[
https://issues.apache.org/jira/browse/CRUNCH-97?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477986#comment-13477986
]
Matthias Friedrich commented on CRUNCH-97:
------------------------------------------
Looks cool! Having parsed way too much text myself, there's a few things I'm
missing. Right now there doesn't seem to be much in the way of error and
missing value handling (noticed none in the test case at least). To make this
universally applicable (which would be the goal for o.a.c.lib, as opposed to
contrib) we'd need a bit more support for dealing with crappy data.
At work we increment separate counters for each field that has an invalid value
and a different counter for records that are completely broken. This helps a
lot with monitoring data streams over time. Also, my experiences with Java 5 (I
never re-measured this) was that throwing multiple exceptions per record when
dealing with crapping data significantly slows down processing, even in
situations when you think I/O bound should totally dominate. I've seen 600%
increases in runtime in pathological situations (throwing exceptions was fast
in Java 5, but creating the stack traces wasn't).
A few things from the nitpicking category: I'd move the inner classes to their
own files to make things easier to read, maybe move implementations to an
Extractors class (Guava style); the private stuff could be made package
private. We could also use a package-info.java file for the javadocs and the
CRUNCH-97 marker is missing from the commit messages (you can squash all three
commits together using "rebase -i", this lets you edit the messages, too).
> Add helpers for parsing PCollection<String> instances
> -----------------------------------------------------
>
> Key: CRUNCH-97
> URL: https://issues.apache.org/jira/browse/CRUNCH-97
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Josh Wills
> Assignee: Josh Wills
> Fix For: 0.4.0
>
> Attachments: CRUNCH-97.patch, CRUNCH-97-take2.patch
>
>
> We should make it a bit easier to parse delimited text files into specific
> data types (e.g., ints, floats, etc.) or combinations of types-- e.g., pairs
> of strings and ints, a Tuple3 of booleans, etc.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira