[ 
https://issues.apache.org/jira/browse/CRUNCH-97?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477986#comment-13477986
 ] 

Matthias Friedrich commented on CRUNCH-97:
------------------------------------------

Looks cool! Having parsed way too much text myself, there's a few things I'm 
missing. Right now there doesn't seem to be much in the way of error and 
missing value handling (noticed none in the test case at least). To make this 
universally applicable (which would be the goal for o.a.c.lib, as opposed to 
contrib) we'd need a bit more support for dealing with crappy data.

At work we increment separate counters for each field that has an invalid value 
and a different counter for records that are completely broken. This helps a 
lot with monitoring data streams over time. Also, my experiences with Java 5 (I 
never re-measured this) was that throwing multiple exceptions per record when 
dealing with crapping data significantly slows down processing, even in 
situations when you think I/O bound should totally dominate. I've seen 600% 
increases in runtime in pathological situations (throwing exceptions was fast 
in Java 5, but creating the stack traces wasn't).

A few things from the nitpicking category: I'd move the inner classes to their 
own files to make things easier to read, maybe move implementations to an 
Extractors class (Guava style); the private stuff could be made package 
private. We could also use a package-info.java file for the javadocs and the 
CRUNCH-97 marker is missing from the commit messages (you can squash all three 
commits together using "rebase -i", this lets you edit the messages, too).

                
> Add helpers for parsing PCollection<String> instances
> -----------------------------------------------------
>
>                 Key: CRUNCH-97
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-97
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>             Fix For: 0.4.0
>
>         Attachments: CRUNCH-97.patch, CRUNCH-97-take2.patch
>
>
> We should make it a bit easier to parse delimited text files into specific 
> data types (e.g., ints, floats, etc.) or combinations of types-- e.g., pairs 
> of strings and ints, a Tuple3 of booleans, etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to