Re: Recommendations for a schema-based data language for use in Hadoop?
FWIW, We use edn (serialized with nippy [1]) in hadoop it works very well for us: https://github.com/Netflix/PigPen In some places we use maps for the expressiveness and in some we use vectors for more performance. Whatever I lose in raw performance I can trivially throw a few more boxes at, so it makes it a non-issue for us. The flexibility of edn outweighs any performance gains of converting back forth to another format and having to worry about translation errors. -Matt [1] https://github.com/ptaoussanis/nippy On Tuesday, August 4, 2015 at 7:05 PM, Ryan Schmitt wrote: Hi Clojure people, I'm currently working on some problems in the big data space, and I'm more or less starting from scratch with the Hadoop ecosystem. I was looking at ways to work with data in Hadoop, and I realized that (because of how InputFormat splitting works) this is a use case where it's actually pretty important to use a data language with an external schema. This probably means ruling out Edn (for performance and space efficiency reasons) and Fressian (managing the Fressian caching domain seems like it could get complicated), which are my default solutions for everything, so now I'm back to the drawing board. I'd rather not use something braindead like JSON or CSV. It seems like there are a few language-agnostic data languages that are popular in this space, such as: * Thrift * Protobuf * Avro But since the Clojure community has very high standards for data languages, as well as a number of different libraries that run code on Hadoop, I was wondering if anyone could provide a recommendation for a fast, extensible, and well-designed data language to use. (Recommendations of what to avoid are also welcome.) -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com (mailto:clojure@googlegroups.com) Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com (mailto:clojure+unsubscr...@googlegroups.com) For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com (mailto:clojure+unsubscr...@googlegroups.com). For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Recommendations for a schema-based data language for use in Hadoop?
Ryan Schmitt rschm...@u.rochester.edu writes: I'm currently working on some problems in the big data space, and I'm more or less starting from scratch with the Hadoop ecosystem. I was looking at ways to work with data in Hadoop, and I realized that (because of how InputFormat splitting works) this is a use case where it's actually pretty important to use a data language with an external schema. At Damballa we extensively use Avro for these sorts of problems. We’ve written a set of Clojure bindings for Avro named “abracad” [1]. Abracad exposes Avro data as native Clojure data (persistent vectors, maps, etc), supports protocol-based de/serialization of custom types, and includes explicit support for defining “EDN-in-Avro” schemas which can include arbitrary Clojure data. We’ve implemented support in the mainline Java Avro project (merged in 1.7.5) for specifying configurable “data models” for MapReduce jobs, which allows Avro MapReduce input to directly produce Clojure data and output to consume Clojure data. And we’ve implemented fairly automatic configuration for such in the Avro dseqs of our “parkour” Clojure-Hadoop/MR integration library [2]. [1] https://github.com/damballa/abracad [2] https://github.com/damballa/parkour -- Marshall Bockrath-Vandegrift llas...@damballa.com Principal Software Engineer, Damballa RD -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Recommendations for a schema-based data language for use in Hadoop?
I suggest using prismatic`s schema library, and generating kryo serializers for your schematized records at compile time. These serializations can be very compact by leveraging the schemas, and kryo is very fast. I've been having success with this approach on Apache Spark. If you aren't married to using hadoop, and you want performance, l suggest you investigate spark as well. I'm planning to extract this automatic schema-based kryo serializer macro junk and release a lib ... when I get around to it. I'd be glad to share the code if you want. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Recommendations for a schema-based data language for use in Hadoop?
Hi Clojure people, I'm currently working on some problems in the big data space, and I'm more or less starting from scratch with the Hadoop ecosystem. I was looking at ways to work with data in Hadoop, and I realized that (because of how InputFormat splitting works) this is a use case where it's actually pretty important to use a data language with an external schema. This probably means ruling out Edn (for performance and space efficiency reasons) and Fressian (managing the Fressian caching domain seems like it could get complicated), which are my default solutions for everything, so now I'm back to the drawing board. I'd rather not use something braindead like JSON or CSV. It seems like there are a few language-agnostic data languages that are popular in this space, such as: * Thrift * Protobuf * Avro But since the Clojure community has very high standards for data languages, as well as a number of different libraries that run code on Hadoop, I was wondering if anyone could provide a recommendation for a fast, extensible, and well-designed data language to use. (Recommendations of what to avoid are also welcome.) -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.