Ben Roling created CRUNCH-690:
---------------------------------

             Summary: materialize() directly from Source skips compression and 
glob handling
                 Key: CRUNCH-690
                 URL: https://issues.apache.org/jira/browse/CRUNCH-690
             Project: Crunch
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.14.0
            Reporter: Ben Roling
            Assignee: Josh Wills


I recently came to notice some oddities that can occur with PCollections 
materialized without any DoFns performed.

For example:
{code}
PCollection<String> data = pipeline.read(From.textFile("/data/*.txt"));
Iterable<String> materializedData = data.materialize();
{code}

Results in:
{noformat}
org.apache.crunch.CrunchRuntimeException: java.io.IOException: No files found 
to materialize at: /data/*.txt
{noformat}

The globbing in the input path is not evaluated.

Another issue is that decompression is not handled.  For example, suppose I do:
{noformat}
PCollection<String> data = pipeline.read(From.textFile("/data/file1.txt.gz"));
Iterable<String> materializedData = data.materialize();
{noformat}

This will succeed, but the Strings in {{materializedData}} will be garbled junk 
as the data was never decompressed.

If I run {{parallelDo(IdentityFn.getInstance())}} on the data before 
materializing, both problems go away.  Globbing and decompression are both 
handled.

The code path to read data materialized directly is different than the code 
path for data that passes through a DoFn.  When the data passes through a DoFn, 
all the magic in FileInputFormat happens.  If it doesn't pass through a DoFn, 
that is skipped.

The latter problem is probably the worse of the two.  Potentially there could 
be a pipeline that succeeds but silently processes data incorrectly.

I only bumped into this with some perhaps weird stuff I was trying in a test 
and I have the workaround of running IdentityFn so a fix is not urgent for me 
but I wonder about possibly patching in something to avoid this silent problem. 
 I'm not going to jump on it immediately but rather at least leave this issue 
out here for visibility.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to