Re: Recommendations for a schema-based data language for use in Hadoop?

2015-08-05 Thread 'Matt Bossenbroek' via Clojure
FWIW, We use edn (serialized with nippy [1]) in hadoop  it works very well for 
us: 

https://github.com/Netflix/PigPen 

In some places we use maps for the expressiveness and in some we use vectors 
for more performance.

Whatever I lose in raw performance I can trivially throw a few more boxes at, 
so it makes it a non-issue for us. The flexibility of edn outweighs any 
performance gains of converting back  forth to another format and having to 
worry about translation errors.

-Matt

[1] https://github.com/ptaoussanis/nippy


On Tuesday, August 4, 2015 at 7:05 PM, Ryan Schmitt wrote:

 Hi Clojure people,
 
 I'm currently working on some problems in the big data space, and I'm more or 
 less starting from scratch with the Hadoop ecosystem. I was looking at ways 
 to work with data in Hadoop, and I realized that (because of how InputFormat 
 splitting works) this is a use case where it's actually pretty important to 
 use a data language with an external schema. This probably means ruling out 
 Edn (for performance and space efficiency reasons) and Fressian (managing the 
 Fressian caching domain seems like it could get complicated), which are my 
 default solutions for everything, so now I'm back to the drawing board. I'd 
 rather not use something braindead like JSON or CSV.
 
 It seems like there are a few language-agnostic data languages that are 
 popular in this space, such as:
 
 * Thrift
 * Protobuf
 * Avro
 
 But since the Clojure community has very high standards for data languages, 
 as well as a number of different libraries that run code on Hadoop, I was 
 wondering if anyone could provide a recommendation for a fast, extensible, 
 and well-designed data language to use. (Recommendations of what to avoid are 
 also welcome.)
 -- 
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com 
 (mailto:clojure@googlegroups.com)
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com 
 (mailto:clojure+unsubscr...@googlegroups.com)
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en
 --- 
 You received this message because you are subscribed to the Google Groups 
 Clojure group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to clojure+unsubscr...@googlegroups.com 
 (mailto:clojure+unsubscr...@googlegroups.com).
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Recommendations for a schema-based data language for use in Hadoop?

2015-08-05 Thread Marshall Bockrath-Vandegrift
Ryan Schmitt rschm...@u.rochester.edu writes:

 I'm currently working on some problems in the big data space, and I'm
 more or less starting from scratch with the Hadoop ecosystem. I was
 looking at ways to work with data in Hadoop, and I realized that
 (because of how InputFormat splitting works) this is a use case where
 it's actually pretty important to use a data language with an external
 schema.

At Damballa we extensively use Avro for these sorts of problems.  We’ve
written a set of Clojure bindings for Avro named “abracad” [1].  Abracad
exposes Avro data as native Clojure data (persistent vectors, maps,
etc), supports protocol-based de/serialization of custom types, and
includes explicit support for defining “EDN-in-Avro” schemas which can
include arbitrary Clojure data.

We’ve implemented support in the mainline Java Avro project (merged in
1.7.5) for specifying configurable “data models” for MapReduce jobs,
which allows Avro MapReduce input to directly produce Clojure data and
output to consume Clojure data.  And we’ve implemented fairly automatic
configuration for such in the Avro dseqs of our “parkour”
Clojure-Hadoop/MR integration library [2].

[1] https://github.com/damballa/abracad

[2] https://github.com/damballa/parkour

-- 
Marshall Bockrath-Vandegrift llas...@damballa.com
Principal Software Engineer, Damballa RD

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Recommendations for a schema-based data language for use in Hadoop?

2015-08-05 Thread Blake Miller
I suggest using prismatic`s schema library, and generating kryo serializers for 
your schematized records at compile time. These serializations can be very 
compact by leveraging the schemas, and kryo is very fast. I've been having 
success with this approach on Apache Spark. If you aren't married to using 
hadoop, and you want performance, l suggest you investigate spark as well.

I'm planning to extract this automatic schema-based kryo serializer macro junk 
and release a lib ... when I get around to it. I'd be glad to share the code if 
you want.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Recommendations for a schema-based data language for use in Hadoop?

2015-08-04 Thread Ryan Schmitt
Hi Clojure people,

I'm currently working on some problems in the big data space, and I'm more 
or less starting from scratch with the Hadoop ecosystem. I was looking at 
ways to work with data in Hadoop, and I realized that (because of how 
InputFormat splitting works) this is a use case where it's actually pretty 
important to use a data language with an external schema. This probably 
means ruling out Edn (for performance and space efficiency reasons) and 
Fressian (managing the Fressian caching domain seems like it could get 
complicated), which are my default solutions for everything, so now I'm 
back to the drawing board. I'd rather not use something braindead like JSON 
or CSV.

It seems like there are a few language-agnostic data languages that are 
popular in this space, such as:

* Thrift
* Protobuf
* Avro

But since the Clojure community has very high standards for data languages, 
as well as a number of different libraries that run code on Hadoop, I was 
wondering if anyone could provide a recommendation for a fast, extensible, 
and well-designed data language to use. (Recommendations of what to avoid 
are also welcome.)

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.