I'm optimistic it will. Re: the MemPipeline, I think it almost always
defaults to writing text output, on the assumption that when you're using
the MemPipeline you're debugging stuff, and text is the fastest way to do
that. But that may not be the right thing to do in all cases-- if anyone
has any feedback on this, we'd be grateful for it.


On Thu, Dec 13, 2012 at 11:15 AM, Jonathan Natkins <[email protected]>wrote:

> Gotcha. Alright, I'll try a true MR pipeline, and see if that improves the
> situtation. Thanks!
>
>
> On Thu, Dec 13, 2012 at 11:12 AM, Josh Wills <[email protected]> wrote:
>
>> Ah-- that is interesting, and almost certainly the reason why we're
>> writing JSON instead of binary Avro.
>>
>>
>> On Thu, Dec 13, 2012 at 11:08 AM, Jonathan Natkins <[email protected]>wrote:
>>
>>> It's 2.0.0 and 1.7.0. I've actually only been running MemPipelines thus
>>> far, to make sure that I've built the job correctly, so it's possible that
>>> that's the issue.
>>>
>>>
>>> On Thu, Dec 13, 2012 at 10:56 AM, Josh Wills <[email protected]>wrote:
>>>
>>>> That surprises me-- Crunch has its own AvroOutputFormat in order to use
>>>> the mapreduce.* APIs, but they delegate much of the work to things like
>>>> DatumWriters/encoders/etc. from Avro's core libraries.
>>>>
>>>> Could I get some detail on hadoop/avro version? Is it just 1.0.x and
>>>> Avro 1.7.0?
>>>>
>>>> J
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 10:35 AM, Jonathan Natkins 
>>>> <[email protected]>wrote:
>>>>
>>>>> Out of curiosity, is there a way to write output from a Crunch
>>>>> pipeline into an Avro-format file? It seems that if you do the
>>>>> collection.write(To.avroFile(path)), you end up just writing JSON. It can
>>>>> certainly be read into an Avro object, but it seems like it would be more
>>>>> efficient to write binary data to the file, so no parsing has to happen.
>>>>>
>>>>> Have I missed an API, or is this a missing feature?
>>>>>
>>>>> Thanks,
>>>>> Natty
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Reply via email to