Hey Josh, Thanks for the tips!
I followed the HBaseSource.java for implementing the ESSource and copied the inputId handling approach: https://github.com/tzolov/elasticsearch-hadoop/blob/master/src/main/java/org/elasticsearch/hadoop/crunch/ESSource.java I don't completely understand the implication of the dummy Path parameter. In this context is the Path needed only for input equality check? The ESTarget is more tricky. I was not sure what to do with the keyClass parameter in the CrunchOutputs.addNamedOutput() so I've set it to String. The ES-Hadoop uses Jackson for JSON serializations and it fails when trying to serialize internal Crunch Writable types. I guess because they are not public. Storing internal Crunch Writable types in ES doesn't make much sense anyway. The current implementation expects a custom (Writable) class to define the JSON format. Perhaps with Avro we can try to reuse the Avro schema. Here is the ES-Hadoop ticket for adding Crunch to the ES-Hadoop project: https://github.com/elasticsearch/elasticsearch-hadoop/issues/20 Shall we deploy the 0.6.0-SNAPSHOT in some public snapshot repo? The https://repository.apache.org/content/groups/snapshots/org/apache/crunch/is empty. Perhaps we can deploy the latest Jenkins builds into this snapshot repo? Unless there is some policy against it? Cheers, Chris On Mon, Apr 8, 2013 at 7:18 AM, Josh Wills <[email protected]> wrote: > Hey Christian, > > Supe-cool. Replies inlined. > > On Sun, Apr 7, 2013 at 8:32 PM, Christian Tzolov < > [email protected] > > wrote: > > > I've been working on Crunch - ElasticSearch ( > http://www.elasticsearch.org/ > > ) > > integration over the weekend :) > > > > Here is my first prototype: > > https://github.com/tzolov/elasticsearch-hadoop#crunch and a sample > > application: http://bit.ly/Y7lasW. > > > > It implements ES Source and Target on top of the ES-Hadoop's ( > > https://github.com/elasticsearch/elasticsearch-hadoop) ESInputFormat and > > ESOutputFormat. > > > > Not sure though what is the best/right way to build Source/Targets for > new > > Input/Output Formats? Any suggestions, references? > > > > I built a Source for HCatalog last week as part of ML: > > > https://github.com/cloudera/ml/blob/master/hcatalog/src/main/java/com/cloudera/science/ml/hcatalog/HCatalogSource.java > > The interesting bit is really in the configureSource method: if the inputId > is < 0, then it's a single-input MapReduce job, and you can essentially > configure the input just as you would for a regular MapReduce. If the > inputId >= 0, then it's a multi-input job (e.g., for a join), and you have > to use CrunchInputs w/a FormatBundle object. The FormatBundle wraps an > InputFormat or an OutputFormat w/any Configuration settings that the > InputFormat/OutputFormat needs. This way, you can have multiple inputs that > use the same InputFormat, but have different configuration settings (e.g., > when you're joining multiple Avro files together and they each need to have > their own schema specified.) > > > > > The write to ES is tricky and at the moment looks more like a hack (see > the > > doc). > > > > Cheers > > Chris > > > > (P.S The prototype doesn't support AvroTypeFamily yet but I've been > looking > > at jackson-dataformat-avro kind of solution (ES-Hadoop relies on Jackson > > for the JSON serialisation) > > > > I'd like to work on this as well-- I'll take a look tomorrow and try to put > together a pull req for anything that I think should be configured > differently. > > J > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
