[
https://issues.apache.org/jira/browse/CRUNCH-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832201#comment-13832201
]
Chao Shi commented on CRUNCH-173:
---------------------------------
My pipeline finished in 5 hours, which used to take more than a day without the
patch. This number is consistent with the previous test at much smaller scale.
bq. Chao Shi If you can add it here, it would be super-interesting to hear more
about your test case pipeline (i.e. the size of the tuples that you're writing,
etc).
We are using crunch to build index shards for a search service. The most
time-consuming stage is building the posting lists. In one shard (i.e.
reducer), there are ~1 billion small records. Each record is an entry in a
posting list. The sort key is term then doc no. Term and doc no are both longs.
> Make WritableTypeFamily more compact for composite types
> --------------------------------------------------------
>
> Key: CRUNCH-173
> URL: https://issues.apache.org/jira/browse/CRUNCH-173
> Project: Crunch
> Issue Type: Bug
> Components: Core
> Reporter: Josh Wills
> Assignee: Josh Wills
> Attachments: CRUNCH-173.patch, CRUNCH-173b.patch
>
>
> I'm throwing this out as something of a strawman JIRA: it's always bugged me
> how verbose the serialization of TupleWritable et al. are compared to the
> Avro formats, so I took a crack at changing their underlying serialization to
> be more compact by doing more things in terms of BytesWritable and using the
> wrapping MapFns in order to do more of the de-serialization work. Patch is
> attached, if anyone is interested in this or has an opinion on whether or not
> this is a good idea, I'd love to hear it. The big pro is that Crunch jobs
> that have to use writables will run faster as a result, the downside is that
> it's not backwards compatible and it makes the code more complex.
--
This message was sent by Atlassian JIRA
(v6.1#6144)