Re: Problem with gzip

Allie Chen Fri, 10 May 2019 13:25:36 -0700

Yes, that is correct.

*From: *Allie Chen <yifangc...@google.com>
*Date: *Fri, May 10, 2019 at 4:21 PM
*To: * <dev@beam.apache.org>
*Cc: * <u...@beam.apache.org>


Yes.
>
> *From: *Lukasz Cwik <lc...@google.com>
> *Date: *Fri, May 10, 2019 at 4:19 PM
> *To: *dev
> *Cc: * <u...@beam.apache.org>
>
> When you had X gzip files and were not using Reshuffle, did you see X
>> workers read and process the files?
>>
>> On Fri, May 10, 2019 at 1:17 PM Allie Chen <yifangc...@google.com> wrote:
>>
>>> Yes, I do see the data after reshuffle are processed in parallel. But
>>> Reshuffle transform itself takes hours or even days to run, according to
>>> one test (24 gzip files, 17 million lines in total) I did.
>>>
>>> The file format for our users are mostly gzip format, since uncompressed
>>> files would be too costly to store (It could be in hundreds of GB).
>>>
>>> Thanks,
>>>
>>> Allie
>>>
>>>
>>> *From: *Lukasz Cwik <lc...@google.com>
>>> *Date: *Fri, May 10, 2019 at 4:07 PM
>>> *To: *dev, <u...@beam.apache.org>
>>>
>>> +u...@beam.apache.org <u...@beam.apache.org>
>>>>
>>>> Reshuffle on Google Cloud Dataflow for a bounded pipeline waits till
>>>> all the data has been read before the next transforms can run. After the
>>>> reshuffle, the data should have been processed in parallel across the
>>>> workers. Did you see this?
>>>>
>>>> Are you able to change the input of your pipeline to use an
>>>> uncompressed file or many compressed files?
>>>>
>>>> On Fri, May 10, 2019 at 1:03 PM Allie Chen <yifangc...@google.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> I am trying to load a gzip file to BigQuey using Dataflow. Since the
>>>>> compressed file is not splittable, one worker is allocated to read the
>>>>> file. The same worker will do all the other transforms since Dataflow 
>>>>> fused
>>>>> all transforms together.  There are a large amount of data in the file, 
>>>>> and
>>>>> I expect to see more workers spinning up after reading transforms. I tried
>>>>> to use Reshuffle Transform
>>>>> <https://github.com/apache/beam/blob/release-2.3.0/sdks/python/apache_beam/transforms/util.py#L516>
>>>>> to prevent the fusion, but it is not scalable since it won’t proceed until
>>>>> all data arrived at this point.
>>>>>
>>>>> Is there any other ways to allow more workers working on all the other
>>>>> transforms after reading?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Allie
>>>>>
>>>>>

Re: Problem with gzip

Reply via email to