[
https://issues.apache.org/jira/browse/CRUNCH-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459414#comment-13459414
]
Gabriel Reid commented on CRUNCH-51:
------------------------------------
I just tried running it again, and spotted the issue (it was only in the task
tracker logs, and not shown in the log on stdout) -- the problem is a
ClassCastException in the ReservoirRecordReader when attempting to use
GenericRecords based on the StringWrapper schema. It looks like the
AvroRecordReader being used in the ReservoirRecordReader isn't being properly
set up with the appropriate Avro PType.
I looked into the patch in more detail, and I get the feeling that there's an
easier way to do this, without needing the ReservoirRecordReader and related
classes at all. The approach that I was thinking of was:
- Use a DoFn to handle the reservoir sampling (instead of a separate
inputformat, source, and record reader)
- Use the same code as is used in MapsideJoin to put the reservoir-sampled data
into the distributed cache
- Rework the CrunchTotalOrderPartitioner to read in the partition sample values
in the same way as MapsideJoin does it
I think that the advantages of this approach are:
- Everything works with built-in Crunch functionality for handling writables
and avro (and maybe other serialization schemes) transparently
- We can avoid adding an extra sample method to PCollection (as there is a move
to reduce the number of methods in PCollection)
Additionally, using a DoFn for the reservoir sampling can be a lot more
efficient, as in some (probably many) cases, the values from the PCollection
will be being passed through the pipline regardless, so there is no additional
IO overhead added to the pipeline, which is the case right now with the
ReservoirFileInputFormat and friends.
One other point -- I had to remove the CrunchTotalOrderPartitionerTest class in
order to be able to build with Maven -- I think it should be included in the
integration tests (if it does need to be included at all).
What do you think? I know you've gotten a closer look at this stuff than I
have, so I might be missing some important points in my reasoning.
> PCollection#sort relies on using a single reducer for total order sorting
> -------------------------------------------------------------------------
>
> Key: CRUNCH-51
> URL: https://issues.apache.org/jira/browse/CRUNCH-51
> Project: Crunch
> Issue Type: Improvement
> Affects Versions: 0.3.0
> Reporter: Gabriel Reid
> Attachments: 0001-CRUNCH-51-Total-Order-Sort.patch, CRUNCH-51.patch,
> CRUNCH-51.patch, SortTest.java
>
>
> The total-order sorting provided by the Sort class (and therefore
> PCollection#sort) relies on using a single reducer in order to provide
> total-order sorting. This is very inefficient for large datasets, and should
> be replaced with a total order partitioner instead.
> For more information, see CRUNCH-23 (and possibly also MAPREDUCE-4574).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira