Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Sajeevan Achuthan
Thanks Cham

On 16 March 2018 at 23:28, Chamikara Jayalath  wrote:

> Actually, I could assign it to you.
>
> On Fri, Mar 16, 2018 at 4:27 PM Chamikara Jayalath 
> wrote:
>
>> Of course. Feel free to add a comment to JIRA and send out a pull request
>> for this.
>> Can one of the JIRA admins assign this to Sajeevan ?
>>
>> Thanks,
>> Cham
>>
>> On Fri, Mar 16, 2018 at 4:22 PM Sajeevan Achuthan <
>> achuthan.sajee...@gmail.com> wrote:
>>
>>> Hi Guys,
>>>
>>> Can I take a look at this issue? If you agree, my Jira id is eachsaj
>>>
>>> thanks
>>> Saj
>>>
>>>
>>>
>>> On 16 March 2018 at 22:13, Chamikara Jayalath 
>>> wrote:
>>>
 Created https://issues.apache.org/jira/browse/BEAM-3867.

 Thanks,
 Cham

 On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov 
 wrote:

> Reading can not be parallelized, but processing can be - so there is
> value in having our file-based sources automatically decompress .tar and
> .tar.gz.
> (also, I suspect that many people use Beam even for cases with a
> modest amount of data, that don't have or need parallelism, just for the
> sake of convenience of Beam's APIs and IOs)
>
> On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>> FWIW, if you have a concat gzip file [1] TextIO and other file-based
>> sources should be able to read that. But we don't support tar files. Is 
>> it
>> possible to perform tar extraction before running the pipeline ? This 
>> step
>> probably cannot be parallelized. So not much value in performing within 
>> the
>> pipeline anyways (other than easy access to various file-systems).
>>
>> - Cham
>>
>> [1] https://stackoverflow.com/questions/8005114/fast-
>> concatenation-of-multiple-gzip-files
>>
>>
>> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
>> achuthan.sajee...@gmail.com> wrote:
>>
>>> Eugene - Yes, you are correct. I tried with a text file &  Beam
>>> wordcount example. The TextIO reader reads some illegal characters as 
>>> seen
>>> below.
>>>
>>>
>>> here’s: 1
>>> addiction: 1
>>> new: 1
>>> we: 1
>>> mood: 1
>>> an: 1
>>> incredible: 1
>>> swings,: 1
>>> known: 1
>>> choices.: 1
>>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
>>> eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
>>> @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>>> 1
>>> already: 2
>>> today: 1
>>> the: 3
>>> generation: 1
>>> wordcount-2
>>>
>>>
>>> thanks
>>> Saj
>>>
>>>
>>> On 16 March 2018 at 17:45, Eugene Kirpichov 
>>> wrote:
>>>
 To clarify: I think natively supporting .tar and .tar.gz would be
 quite useful. I'm just saying that currently we don't.

 On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <
 kirpic...@google.com> wrote:

> The code behaves as I expected, and the output is corrupt.
> Beam unzipped the .gz, but then interpreted the .tar as a text
> file, and split the .tar file by \n.
> E.g. the first file of the output starts with lines:
> A20171012.1145+0200-1200+0200_epg10-1_node.xml/
> 75500017517513252764467016513 5ustar
> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/
> data644000175175360513252764467017353 0ustar
> eachsajeachsaj
>
> which are clearly not the expected input.
>
> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
> achuthan.sajee...@gmail.com> wrote:
>
>> Eugene, I ran the code and it works fine.  I am very confident in
>> this case. I appreciate you guys for the great work.
>>
>> The code supposed to show that Beam TextIO can read the double
>> compressed files and write output without any processing. so ignored 
>> the
>> processing steps. I agree with you the further processing is not 
>> easy in
>> this case.
>>
>>
>> import org.apache.beam.sdk.Pipeline;
>> import org.apache.beam.sdk.io.TextIO;
>> import org.apache.beam.sdk.options.PipelineOptions;
>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>> import org.apache.beam.sdk.transforms.DoFn;
>> import 

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Chamikara Jayalath
Actually, I could assign it to you.

On Fri, Mar 16, 2018 at 4:27 PM Chamikara Jayalath 
wrote:

> Of course. Feel free to add a comment to JIRA and send out a pull request
> for this.
> Can one of the JIRA admins assign this to Sajeevan ?
>
> Thanks,
> Cham
>
> On Fri, Mar 16, 2018 at 4:22 PM Sajeevan Achuthan <
> achuthan.sajee...@gmail.com> wrote:
>
>> Hi Guys,
>>
>> Can I take a look at this issue? If you agree, my Jira id is eachsaj
>>
>> thanks
>> Saj
>>
>>
>>
>> On 16 March 2018 at 22:13, Chamikara Jayalath 
>> wrote:
>>
>>> Created https://issues.apache.org/jira/browse/BEAM-3867.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov 
>>> wrote:
>>>
 Reading can not be parallelized, but processing can be - so there is
 value in having our file-based sources automatically decompress .tar and
 .tar.gz.
 (also, I suspect that many people use Beam even for cases with a modest
 amount of data, that don't have or need parallelism, just for the sake of
 convenience of Beam's APIs and IOs)

 On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> FWIW, if you have a concat gzip file [1] TextIO and other file-based
> sources should be able to read that. But we don't support tar files. Is it
> possible to perform tar extraction before running the pipeline ? This step
> probably cannot be parallelized. So not much value in performing within 
> the
> pipeline anyways (other than easy access to various file-systems).
>
> - Cham
>
> [1]
> https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files
>
>
> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
> achuthan.sajee...@gmail.com> wrote:
>
>> Eugene - Yes, you are correct. I tried with a text file &  Beam
>> wordcount example. The TextIO reader reads some illegal characters as 
>> seen
>> below.
>>
>>
>> here’s: 1
>> addiction: 1
>> new: 1
>> we: 1
>> mood: 1
>> an: 1
>> incredible: 1
>> swings,: 1
>> known: 1
>> choices.: 1
>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>> 1
>> already: 2
>> today: 1
>> the: 3
>> generation: 1
>> wordcount-2
>>
>>
>> thanks
>> Saj
>>
>>
>> On 16 March 2018 at 17:45, Eugene Kirpichov 
>> wrote:
>>
>>> To clarify: I think natively supporting .tar and .tar.gz would be
>>> quite useful. I'm just saying that currently we don't.
>>>
>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <
>>> kirpic...@google.com> wrote:
>>>
 The code behaves as I expected, and the output is corrupt.
 Beam unzipped the .gz, but then interpreted the .tar as a text
 file, and split the .tar file by \n.
 E.g. the first file of the output starts with lines:
 A20171012.1145+0200-1200+0200_epg10-1_node.xml/75500017517513252764467016513
 5ustar
 eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data644000175175360513252764467017353
 0ustar  eachsajeachsaj

 which are clearly not the expected input.

 On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
 achuthan.sajee...@gmail.com> wrote:

> Eugene, I ran the code and it works fine.  I am very confident in
> this case. I appreciate you guys for the great work.
>
> The code supposed to show that Beam TextIO can read the double
> compressed files and write output without any processing. so ignored 
> the
> processing steps. I agree with you the further processing is not easy 
> in
> this case.
>
>
> import org.apache.beam.sdk.Pipeline;
> import org.apache.beam.sdk.io.TextIO;
> import org.apache.beam.sdk.options.PipelineOptions;
> import org.apache.beam.sdk.options.PipelineOptionsFactory;
> import org.apache.beam.sdk.transforms.DoFn;
> import org.apache.beam.sdk.transforms.ParDo;
>
> public class ReadCompressedTextFile {
>
> public static void main(String[] args) {
> PipelineOptions optios =
> PipelineOptionsFactory.fromArgs(args).withValidation().create();
> Pipeline p = Pipeline.create(optios);

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Chamikara Jayalath
Of course. Feel free to add a comment to JIRA and send out a pull request
for this.
Can one of the JIRA admins assign this to Sajeevan ?

Thanks,
Cham

On Fri, Mar 16, 2018 at 4:22 PM Sajeevan Achuthan <
achuthan.sajee...@gmail.com> wrote:

> Hi Guys,
>
> Can I take a look at this issue? If you agree, my Jira id is eachsaj
>
> thanks
> Saj
>
>
>
> On 16 March 2018 at 22:13, Chamikara Jayalath 
> wrote:
>
>> Created https://issues.apache.org/jira/browse/BEAM-3867.
>>
>> Thanks,
>> Cham
>>
>> On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov 
>> wrote:
>>
>>> Reading can not be parallelized, but processing can be - so there is
>>> value in having our file-based sources automatically decompress .tar and
>>> .tar.gz.
>>> (also, I suspect that many people use Beam even for cases with a modest
>>> amount of data, that don't have or need parallelism, just for the sake of
>>> convenience of Beam's APIs and IOs)
>>>
>>> On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath 
>>> wrote:
>>>
 FWIW, if you have a concat gzip file [1] TextIO and other file-based
 sources should be able to read that. But we don't support tar files. Is it
 possible to perform tar extraction before running the pipeline ? This step
 probably cannot be parallelized. So not much value in performing within the
 pipeline anyways (other than easy access to various file-systems).

 - Cham

 [1]
 https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files


 On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
 achuthan.sajee...@gmail.com> wrote:

> Eugene - Yes, you are correct. I tried with a text file &  Beam
> wordcount example. The TextIO reader reads some illegal characters as seen
> below.
>
>
> here’s: 1
> addiction: 1
> new: 1
> we: 1
> mood: 1
> an: 1
> incredible: 1
> swings,: 1
> known: 1
> choices.: 1
> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
> 1
> already: 2
> today: 1
> the: 3
> generation: 1
> wordcount-2
>
>
> thanks
> Saj
>
>
> On 16 March 2018 at 17:45, Eugene Kirpichov 
> wrote:
>
>> To clarify: I think natively supporting .tar and .tar.gz would be
>> quite useful. I'm just saying that currently we don't.
>>
>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <
>> kirpic...@google.com> wrote:
>>
>>> The code behaves as I expected, and the output is corrupt.
>>> Beam unzipped the .gz, but then interpreted the .tar as a text file,
>>> and split the .tar file by \n.
>>> E.g. the first file of the output starts with lines:
>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/75500017517513252764467016513
>>> 5ustar
>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data644000175175360513252764467017353
>>> 0ustar  eachsajeachsaj
>>>
>>> which are clearly not the expected input.
>>>
>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>> achuthan.sajee...@gmail.com> wrote:
>>>
 Eugene, I ran the code and it works fine.  I am very confident in
 this case. I appreciate you guys for the great work.

 The code supposed to show that Beam TextIO can read the double
 compressed files and write output without any processing. so ignored 
 the
 processing steps. I agree with you the further processing is not easy 
 in
 this case.


 import org.apache.beam.sdk.Pipeline;
 import org.apache.beam.sdk.io.TextIO;
 import org.apache.beam.sdk.options.PipelineOptions;
 import org.apache.beam.sdk.options.PipelineOptionsFactory;
 import org.apache.beam.sdk.transforms.DoFn;
 import org.apache.beam.sdk.transforms.ParDo;

 public class ReadCompressedTextFile {

 public static void main(String[] args) {
 PipelineOptions optios =
 PipelineOptionsFactory.fromArgs(args).withValidation().create();
 Pipeline p = Pipeline.create(optios);

 p.apply("ReadLines",
 TextIO.read().from("./dataset.tar.gz")

 ).apply(ParDo.of(new DoFn(){
 @ProcessElement
 public void processElement(ProcessContext c) {
 

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Sajeevan Achuthan
Hi Guys,

Can I take a look at this issue? If you agree, my Jira id is eachsaj

thanks
Saj



On 16 March 2018 at 22:13, Chamikara Jayalath  wrote:

> Created https://issues.apache.org/jira/browse/BEAM-3867.
>
> Thanks,
> Cham
>
> On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov 
> wrote:
>
>> Reading can not be parallelized, but processing can be - so there is
>> value in having our file-based sources automatically decompress .tar and
>> .tar.gz.
>> (also, I suspect that many people use Beam even for cases with a modest
>> amount of data, that don't have or need parallelism, just for the sake of
>> convenience of Beam's APIs and IOs)
>>
>> On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath 
>> wrote:
>>
>>> FWIW, if you have a concat gzip file [1] TextIO and other file-based
>>> sources should be able to read that. But we don't support tar files. Is it
>>> possible to perform tar extraction before running the pipeline ? This step
>>> probably cannot be parallelized. So not much value in performing within the
>>> pipeline anyways (other than easy access to various file-systems).
>>>
>>> - Cham
>>>
>>> [1] https://stackoverflow.com/questions/8005114/fast-
>>> concatenation-of-multiple-gzip-files
>>>
>>>
>>> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
>>> achuthan.sajee...@gmail.com> wrote:
>>>
 Eugene - Yes, you are correct. I tried with a text file &  Beam
 wordcount example. The TextIO reader reads some illegal characters as seen
 below.


 here’s: 1
 addiction: 1
 new: 1
 we: 1
 mood: 1
 an: 1
 incredible: 1
 swings,: 1
 known: 1
 choices.: 1
 ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
 eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
 @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
 @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
 @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
 @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
 @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
 @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
 1
 already: 2
 today: 1
 the: 3
 generation: 1
 wordcount-2


 thanks
 Saj


 On 16 March 2018 at 17:45, Eugene Kirpichov 
 wrote:

> To clarify: I think natively supporting .tar and .tar.gz would be
> quite useful. I'm just saying that currently we don't.
>
> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <
> kirpic...@google.com> wrote:
>
>> The code behaves as I expected, and the output is corrupt.
>> Beam unzipped the .gz, but then interpreted the .tar as a text file,
>> and split the .tar file by \n.
>> E.g. the first file of the output starts with lines:
>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/
>> 75500017517513252764467016513 5ustar
>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/
>> data644000175175360513252764467017353 0ustar
>> eachsajeachsaj
>>
>> which are clearly not the expected input.
>>
>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>> achuthan.sajee...@gmail.com> wrote:
>>
>>> Eugene, I ran the code and it works fine.  I am very confident in
>>> this case. I appreciate you guys for the great work.
>>>
>>> The code supposed to show that Beam TextIO can read the double
>>> compressed files and write output without any processing. so ignored the
>>> processing steps. I agree with you the further processing is not easy in
>>> this case.
>>>
>>>
>>> import org.apache.beam.sdk.Pipeline;
>>> import org.apache.beam.sdk.io.TextIO;
>>> import org.apache.beam.sdk.options.PipelineOptions;
>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>> import org.apache.beam.sdk.transforms.DoFn;
>>> import org.apache.beam.sdk.transforms.ParDo;
>>>
>>> public class ReadCompressedTextFile {
>>>
>>> public static void main(String[] args) {
>>> PipelineOptions optios = PipelineOptionsFactory.
>>> fromArgs(args).withValidation().create();
>>> Pipeline p = Pipeline.create(optios);
>>>
>>> p.apply("ReadLines",
>>> TextIO.read().from("./dataset.tar.gz")
>>>
>>> ).apply(ParDo.of(new DoFn(){
>>> @ProcessElement
>>> public void processElement(ProcessContext c) {
>>> c.output(c.element());
>>> // Just write the all content to "/tmp/filout/outputfile"
>>> }
>>>
>>> }))
>>>
>>>.apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>
>>> p.run().waitUntilFinish();
>>> }
>>>
>>> }
>>>
>>> The full code, data file & output contents are 

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Chamikara Jayalath
Created https://issues.apache.org/jira/browse/BEAM-3867.

Thanks,
Cham

On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov 
wrote:

> Reading can not be parallelized, but processing can be - so there is value
> in having our file-based sources automatically decompress .tar and .tar.gz.
> (also, I suspect that many people use Beam even for cases with a modest
> amount of data, that don't have or need parallelism, just for the sake of
> convenience of Beam's APIs and IOs)
>
> On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath 
> wrote:
>
>> FWIW, if you have a concat gzip file [1] TextIO and other file-based
>> sources should be able to read that. But we don't support tar files. Is it
>> possible to perform tar extraction before running the pipeline ? This step
>> probably cannot be parallelized. So not much value in performing within the
>> pipeline anyways (other than easy access to various file-systems).
>>
>> - Cham
>>
>> [1]
>> https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files
>>
>>
>> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
>> achuthan.sajee...@gmail.com> wrote:
>>
>>> Eugene - Yes, you are correct. I tried with a text file &  Beam
>>> wordcount example. The TextIO reader reads some illegal characters as seen
>>> below.
>>>
>>>
>>> here’s: 1
>>> addiction: 1
>>> new: 1
>>> we: 1
>>> mood: 1
>>> an: 1
>>> incredible: 1
>>> swings,: 1
>>> known: 1
>>> choices.: 1
>>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>>> 1
>>> already: 2
>>> today: 1
>>> the: 3
>>> generation: 1
>>> wordcount-2
>>>
>>>
>>> thanks
>>> Saj
>>>
>>>
>>> On 16 March 2018 at 17:45, Eugene Kirpichov 
>>> wrote:
>>>
 To clarify: I think natively supporting .tar and .tar.gz would be quite
 useful. I'm just saying that currently we don't.

 On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov 
 wrote:

> The code behaves as I expected, and the output is corrupt.
> Beam unzipped the .gz, but then interpreted the .tar as a text file,
> and split the .tar file by \n.
> E.g. the first file of the output starts with lines:
> A20171012.1145+0200-1200+0200_epg10-1_node.xml/75500017517513252764467016513
> 5ustar
> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data644000175175360513252764467017353
> 0ustar  eachsajeachsaj
>
> which are clearly not the expected input.
>
> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
> achuthan.sajee...@gmail.com> wrote:
>
>> Eugene, I ran the code and it works fine.  I am very confident in
>> this case. I appreciate you guys for the great work.
>>
>> The code supposed to show that Beam TextIO can read the double
>> compressed files and write output without any processing. so ignored the
>> processing steps. I agree with you the further processing is not easy in
>> this case.
>>
>>
>> import org.apache.beam.sdk.Pipeline;
>> import org.apache.beam.sdk.io.TextIO;
>> import org.apache.beam.sdk.options.PipelineOptions;
>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>> import org.apache.beam.sdk.transforms.DoFn;
>> import org.apache.beam.sdk.transforms.ParDo;
>>
>> public class ReadCompressedTextFile {
>>
>> public static void main(String[] args) {
>> PipelineOptions optios =
>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>> Pipeline p = Pipeline.create(optios);
>>
>> p.apply("ReadLines",
>> TextIO.read().from("./dataset.tar.gz")
>>
>> ).apply(ParDo.of(new DoFn(){
>> @ProcessElement
>> public void processElement(ProcessContext c) {
>> c.output(c.element());
>> // Just write the all content to "/tmp/filout/outputfile"
>> }
>>
>> }))
>>
>>.apply(TextIO.write().to("/tmp/filout/outputfile"));
>>
>> p.run().waitUntilFinish();
>> }
>>
>> }
>>
>> The full code, data file & output contents are attached.
>>
>> thanks
>> Saj
>>
>>
>>
>>
>>
>> On 16 March 2018 at 16:56, Eugene Kirpichov 
>> wrote:
>>
>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can
>>> not handle properly .tar. Did you run this code? Did your test .tar.gz 
>>> file
>>> contain multiple 

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Eugene Kirpichov
Reading can not be parallelized, but processing can be - so there is value
in having our file-based sources automatically decompress .tar and .tar.gz.
(also, I suspect that many people use Beam even for cases with a modest
amount of data, that don't have or need parallelism, just for the sake of
convenience of Beam's APIs and IOs)

On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath 
wrote:

> FWIW, if you have a concat gzip file [1] TextIO and other file-based
> sources should be able to read that. But we don't support tar files. Is it
> possible to perform tar extraction before running the pipeline ? This step
> probably cannot be parallelized. So not much value in performing within the
> pipeline anyways (other than easy access to various file-systems).
>
> - Cham
>
> [1]
> https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files
>
>
> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
> achuthan.sajee...@gmail.com> wrote:
>
>> Eugene - Yes, you are correct. I tried with a text file &  Beam wordcount
>> example. The TextIO reader reads some illegal characters as seen below.
>>
>>
>> here’s: 1
>> addiction: 1
>> new: 1
>> we: 1
>> mood: 1
>> an: 1
>> incredible: 1
>> swings,: 1
>> known: 1
>> choices.: 1
>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>> 1
>> already: 2
>> today: 1
>> the: 3
>> generation: 1
>> wordcount-2
>>
>>
>> thanks
>> Saj
>>
>>
>> On 16 March 2018 at 17:45, Eugene Kirpichov  wrote:
>>
>>> To clarify: I think natively supporting .tar and .tar.gz would be quite
>>> useful. I'm just saying that currently we don't.
>>>
>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov 
>>> wrote:
>>>
 The code behaves as I expected, and the output is corrupt.
 Beam unzipped the .gz, but then interpreted the .tar as a text file,
 and split the .tar file by \n.
 E.g. the first file of the output starts with lines:
 A20171012.1145+0200-1200+0200_epg10-1_node.xml/75500017517513252764467016513
 5ustar
 eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data644000175175360513252764467017353
 0ustar  eachsajeachsaj

 which are clearly not the expected input.

 On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
 achuthan.sajee...@gmail.com> wrote:

> Eugene, I ran the code and it works fine.  I am very confident in this
> case. I appreciate you guys for the great work.
>
> The code supposed to show that Beam TextIO can read the double
> compressed files and write output without any processing. so ignored the
> processing steps. I agree with you the further processing is not easy in
> this case.
>
>
> import org.apache.beam.sdk.Pipeline;
> import org.apache.beam.sdk.io.TextIO;
> import org.apache.beam.sdk.options.PipelineOptions;
> import org.apache.beam.sdk.options.PipelineOptionsFactory;
> import org.apache.beam.sdk.transforms.DoFn;
> import org.apache.beam.sdk.transforms.ParDo;
>
> public class ReadCompressedTextFile {
>
> public static void main(String[] args) {
> PipelineOptions optios =
> PipelineOptionsFactory.fromArgs(args).withValidation().create();
> Pipeline p = Pipeline.create(optios);
>
> p.apply("ReadLines",
> TextIO.read().from("./dataset.tar.gz")
>
> ).apply(ParDo.of(new DoFn(){
> @ProcessElement
> public void processElement(ProcessContext c) {
> c.output(c.element());
> // Just write the all content to "/tmp/filout/outputfile"
> }
>
> }))
>
>.apply(TextIO.write().to("/tmp/filout/outputfile"));
>
> p.run().waitUntilFinish();
> }
>
> }
>
> The full code, data file & output contents are attached.
>
> thanks
> Saj
>
>
>
>
>
> On 16 March 2018 at 16:56, Eugene Kirpichov 
> wrote:
>
>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can
>> not handle properly .tar. Did you run this code? Did your test .tar.gz 
>> file
>> contain multiple files? Did you obtain the expected output, identical to
>> the input except for order of lines?
>> (also, the ParDo in this code doesn't do anything - it outputs its
>> input - so it can be removed)
>>
>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>> 

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Jean-Baptiste Onofré
Gzip is supported by TextIO. However you are right, tar is not yet supported. 
It's similar in the way of dealing with entries.

Could you please create a Jira about that ?

Thanks
Regards
JB

Le 16 mars 2018 à 14:50, à 14:50, Chamikara Jayalath  a 
écrit:
>FWIW, if you have a concat gzip file [1] TextIO and other file-based
>sources should be able to read that. But we don't support tar files. Is
>it
>possible to perform tar extraction before running the pipeline ? This
>step
>probably cannot be parallelized. So not much value in performing within
>the
>pipeline anyways (other than easy access to various file-systems).
>
>- Cham
>
>[1]
>https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files
>
>On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
>achuthan.sajee...@gmail.com> wrote:
>
>> Eugene - Yes, you are correct. I tried with a text file &  Beam
>wordcount
>> example. The TextIO reader reads some illegal characters as seen
>below.
>>
>>
>> here’s: 1
>> addiction: 1
>> new: 1
>> we: 1
>> mood: 1
>> an: 1
>> incredible: 1
>> swings,: 1
>> known: 1
>> choices.: 1
>>
>^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>> 1
>> already: 2
>> today: 1
>> the: 3
>> generation: 1
>> wordcount-2
>>
>>
>> thanks
>> Saj
>>
>>
>> On 16 March 2018 at 17:45, Eugene Kirpichov 
>wrote:
>>
>>> To clarify: I think natively supporting .tar and .tar.gz would be
>quite
>>> useful. I'm just saying that currently we don't.
>>>
>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov
>
>>> wrote:
>>>
 The code behaves as I expected, and the output is corrupt.
 Beam unzipped the .gz, but then interpreted the .tar as a text
>file, and
 split the .tar file by \n.
 E.g. the first file of the output starts with lines:

>A20171012.1145+0200-1200+0200_epg10-1_node.xml/75500017517513252764467016513
 5ustar

>eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data644000175175360513252764467017353
 0ustar  eachsajeachsaj

 which are clearly not the expected input.

 On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
 achuthan.sajee...@gmail.com> wrote:

> Eugene, I ran the code and it works fine.  I am very confident in
>this
> case. I appreciate you guys for the great work.
>
> The code supposed to show that Beam TextIO can read the double
> compressed files and write output without any processing. so
>ignored the
> processing steps. I agree with you the further processing is not
>easy in
> this case.
>
>
> import org.apache.beam.sdk.Pipeline;
> import org.apache.beam.sdk.io.TextIO;
> import org.apache.beam.sdk.options.PipelineOptions;
> import org.apache.beam.sdk.options.PipelineOptionsFactory;
> import org.apache.beam.sdk.transforms.DoFn;
> import org.apache.beam.sdk.transforms.ParDo;
>
> public class ReadCompressedTextFile {
>
> public static void main(String[] args) {
> PipelineOptions optios =
> PipelineOptionsFactory.fromArgs(args).withValidation().create();
> Pipeline p = Pipeline.create(optios);
>
> p.apply("ReadLines",
> TextIO.read().from("./dataset.tar.gz")
>
> ).apply(ParDo.of(new DoFn(){
> @ProcessElement
> public void processElement(ProcessContext c) {
> c.output(c.element());
> // Just write the all content to "/tmp/filout/outputfile"
> }
>
> }))
>
>.apply(TextIO.write().to("/tmp/filout/outputfile"));
>
> p.run().waitUntilFinish();
> }
>
> }
>
> The full code, data file & output contents are attached.
>
> thanks
> Saj
>
>
>
>
>
> On 16 March 2018 at 16:56, Eugene Kirpichov 
> wrote:
>
>> Sajeevan - I'm quite confident that TextIO can handle .gz, but
>can not
>> handle properly .tar. Did you run this code? Did your test
>.tar.gz file
>> contain multiple files? Did you obtain the expected output,
>identical to
>> the input except for order of lines?
>> (also, the ParDo in this code doesn't do anything - it outputs
>its
>> input - so it can be removed)
>>
>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>> achuthan.sajee...@gmail.com> wrote:
>>
>>> Hi Guys,
>>>
>>> The TextIo can handle the tar.gz type double compressed files.
>See
>>> 

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Chamikara Jayalath
FWIW, if you have a concat gzip file [1] TextIO and other file-based
sources should be able to read that. But we don't support tar files. Is it
possible to perform tar extraction before running the pipeline ? This step
probably cannot be parallelized. So not much value in performing within the
pipeline anyways (other than easy access to various file-systems).

- Cham

[1]
https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files

On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
achuthan.sajee...@gmail.com> wrote:

> Eugene - Yes, you are correct. I tried with a text file &  Beam wordcount
> example. The TextIO reader reads some illegal characters as seen below.
>
>
> here’s: 1
> addiction: 1
> new: 1
> we: 1
> mood: 1
> an: 1
> incredible: 1
> swings,: 1
> known: 1
> choices.: 1
> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
> 1
> already: 2
> today: 1
> the: 3
> generation: 1
> wordcount-2
>
>
> thanks
> Saj
>
>
> On 16 March 2018 at 17:45, Eugene Kirpichov  wrote:
>
>> To clarify: I think natively supporting .tar and .tar.gz would be quite
>> useful. I'm just saying that currently we don't.
>>
>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov 
>> wrote:
>>
>>> The code behaves as I expected, and the output is corrupt.
>>> Beam unzipped the .gz, but then interpreted the .tar as a text file, and
>>> split the .tar file by \n.
>>> E.g. the first file of the output starts with lines:
>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/75500017517513252764467016513
>>> 5ustar
>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data644000175175360513252764467017353
>>> 0ustar  eachsajeachsaj
>>>
>>> which are clearly not the expected input.
>>>
>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>> achuthan.sajee...@gmail.com> wrote:
>>>
 Eugene, I ran the code and it works fine.  I am very confident in this
 case. I appreciate you guys for the great work.

 The code supposed to show that Beam TextIO can read the double
 compressed files and write output without any processing. so ignored the
 processing steps. I agree with you the further processing is not easy in
 this case.


 import org.apache.beam.sdk.Pipeline;
 import org.apache.beam.sdk.io.TextIO;
 import org.apache.beam.sdk.options.PipelineOptions;
 import org.apache.beam.sdk.options.PipelineOptionsFactory;
 import org.apache.beam.sdk.transforms.DoFn;
 import org.apache.beam.sdk.transforms.ParDo;

 public class ReadCompressedTextFile {

 public static void main(String[] args) {
 PipelineOptions optios =
 PipelineOptionsFactory.fromArgs(args).withValidation().create();
 Pipeline p = Pipeline.create(optios);

 p.apply("ReadLines",
 TextIO.read().from("./dataset.tar.gz")

 ).apply(ParDo.of(new DoFn(){
 @ProcessElement
 public void processElement(ProcessContext c) {
 c.output(c.element());
 // Just write the all content to "/tmp/filout/outputfile"
 }

 }))

.apply(TextIO.write().to("/tmp/filout/outputfile"));

 p.run().waitUntilFinish();
 }

 }

 The full code, data file & output contents are attached.

 thanks
 Saj





 On 16 March 2018 at 16:56, Eugene Kirpichov 
 wrote:

> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
> handle properly .tar. Did you run this code? Did your test .tar.gz file
> contain multiple files? Did you obtain the expected output, identical to
> the input except for order of lines?
> (also, the ParDo in this code doesn't do anything - it outputs its
> input - so it can be removed)
>
> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
> achuthan.sajee...@gmail.com> wrote:
>
>> Hi Guys,
>>
>> The TextIo can handle the tar.gz type double compressed files. See
>> the code test code.
>>
>>  PipelineOptions optios =
>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>> Pipeline p = Pipeline.create(optios);
>>
>>* p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>   .apply(ParDo.of(new DoFn(){
>> @ProcessElement
>> public void processElement(ProcessContext c) {
>> 

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Sajeevan Achuthan
Eugene - Yes, you are correct. I tried with a text file &  Beam wordcount
example. The TextIO reader reads some illegal characters as seen below.


here’s: 1
addiction: 1
new: 1
we: 1
mood: 1
an: 1
incredible: 1
swings,: 1
known: 1
choices.: 1
^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
1
already: 2
today: 1
the: 3
generation: 1
wordcount-2


thanks
Saj


On 16 March 2018 at 17:45, Eugene Kirpichov  wrote:

> To clarify: I think natively supporting .tar and .tar.gz would be quite
> useful. I'm just saying that currently we don't.
>
> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov 
> wrote:
>
>> The code behaves as I expected, and the output is corrupt.
>> Beam unzipped the .gz, but then interpreted the .tar as a text file, and
>> split the .tar file by \n.
>> E.g. the first file of the output starts with lines:
>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/
>> 75500017517513252764467016513 5ustar
>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/
>> data644000175175360513252764467017353 0ustar
>> eachsajeachsaj
>>
>> which are clearly not the expected input.
>>
>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>> achuthan.sajee...@gmail.com> wrote:
>>
>>> Eugene, I ran the code and it works fine.  I am very confident in this
>>> case. I appreciate you guys for the great work.
>>>
>>> The code supposed to show that Beam TextIO can read the double
>>> compressed files and write output without any processing. so ignored the
>>> processing steps. I agree with you the further processing is not easy in
>>> this case.
>>>
>>>
>>> import org.apache.beam.sdk.Pipeline;
>>> import org.apache.beam.sdk.io.TextIO;
>>> import org.apache.beam.sdk.options.PipelineOptions;
>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>> import org.apache.beam.sdk.transforms.DoFn;
>>> import org.apache.beam.sdk.transforms.ParDo;
>>>
>>> public class ReadCompressedTextFile {
>>>
>>> public static void main(String[] args) {
>>> PipelineOptions optios = PipelineOptionsFactory.
>>> fromArgs(args).withValidation().create();
>>> Pipeline p = Pipeline.create(optios);
>>>
>>> p.apply("ReadLines",
>>> TextIO.read().from("./dataset.tar.gz")
>>>
>>> ).apply(ParDo.of(new DoFn(){
>>> @ProcessElement
>>> public void processElement(ProcessContext c) {
>>> c.output(c.element());
>>> // Just write the all content to "/tmp/filout/outputfile"
>>> }
>>>
>>> }))
>>>
>>>.apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>
>>> p.run().waitUntilFinish();
>>> }
>>>
>>> }
>>>
>>> The full code, data file & output contents are attached.
>>>
>>> thanks
>>> Saj
>>>
>>>
>>>
>>>
>>>
>>> On 16 March 2018 at 16:56, Eugene Kirpichov 
>>> wrote:
>>>
 Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
 handle properly .tar. Did you run this code? Did your test .tar.gz file
 contain multiple files? Did you obtain the expected output, identical to
 the input except for order of lines?
 (also, the ParDo in this code doesn't do anything - it outputs its
 input - so it can be removed)

 On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
 achuthan.sajee...@gmail.com> wrote:

> Hi Guys,
>
> The TextIo can handle the tar.gz type double compressed files. See the
> code test code.
>
>  PipelineOptions optios = PipelineOptionsFactory.
> fromArgs(args).withValidation().create();
> Pipeline p = Pipeline.create(optios);
>
>* p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>   .apply(ParDo.of(new DoFn(){
> @ProcessElement
> public void processElement(ProcessContext c) {
> c.output(c.element());
> }
>
> }))
>
>.apply(TextIO.write().to("/tmp/filout/outputfile"));
>
> p.run().waitUntilFinish();
>
> Thanks
> /Saj
>
> On 16 March 2018 at 04:29, Pablo Estrada  wrote:
>
>> Hi!
>> Quick questions:
>> - which sdk are you using?
>> - is this batch or streaming?
>>
>> As JB mentioned, TextIO is able to work with compressed files that
>> contain text. Nothing currently handles the double decompression that I
>> believe you're looking for.
>> TextIO for Java is also able to"watch" a directory for new files. If
>> you're able to (outside of 

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Eugene Kirpichov
The code behaves as I expected, and the output is corrupt.
Beam unzipped the .gz, but then interpreted the .tar as a text file, and
split the .tar file by \n.
E.g. the first file of the output starts with lines:
A20171012.1145+0200-1200+0200_epg10-1_node.xml/75500017517513252764467016513
5ustar
eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data644000175175360513252764467017353
0ustar  eachsajeachsaj

which are clearly not the expected input.

On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
achuthan.sajee...@gmail.com> wrote:

> Eugene, I ran the code and it works fine.  I am very confident in this
> case. I appreciate you guys for the great work.
>
> The code supposed to show that Beam TextIO can read the double compressed
> files and write output without any processing. so ignored the processing
> steps. I agree with you the further processing is not easy in this case.
>
>
> import org.apache.beam.sdk.Pipeline;
> import org.apache.beam.sdk.io.TextIO;
> import org.apache.beam.sdk.options.PipelineOptions;
> import org.apache.beam.sdk.options.PipelineOptionsFactory;
> import org.apache.beam.sdk.transforms.DoFn;
> import org.apache.beam.sdk.transforms.ParDo;
>
> public class ReadCompressedTextFile {
>
> public static void main(String[] args) {
> PipelineOptions optios =
> PipelineOptionsFactory.fromArgs(args).withValidation().create();
> Pipeline p = Pipeline.create(optios);
>
> p.apply("ReadLines",
> TextIO.read().from("./dataset.tar.gz")
>
> ).apply(ParDo.of(new DoFn(){
> @ProcessElement
> public void processElement(ProcessContext c) {
> c.output(c.element());
> // Just write the all content to "/tmp/filout/outputfile"
> }
>
> }))
>
>.apply(TextIO.write().to("/tmp/filout/outputfile"));
>
> p.run().waitUntilFinish();
> }
>
> }
>
> The full code, data file & output contents are attached.
>
> thanks
> Saj
>
>
>
>
>
> On 16 March 2018 at 16:56, Eugene Kirpichov  wrote:
>
>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
>> handle properly .tar. Did you run this code? Did your test .tar.gz file
>> contain multiple files? Did you obtain the expected output, identical to
>> the input except for order of lines?
>> (also, the ParDo in this code doesn't do anything - it outputs its input
>> - so it can be removed)
>>
>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>> achuthan.sajee...@gmail.com> wrote:
>>
>>> Hi Guys,
>>>
>>> The TextIo can handle the tar.gz type double compressed files. See the
>>> code test code.
>>>
>>>  PipelineOptions optios =
>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>> Pipeline p = Pipeline.create(optios);
>>>
>>>* p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>>   .apply(ParDo.of(new DoFn(){
>>> @ProcessElement
>>> public void processElement(ProcessContext c) {
>>> c.output(c.element());
>>> }
>>>
>>> }))
>>>
>>>.apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>
>>> p.run().waitUntilFinish();
>>>
>>> Thanks
>>> /Saj
>>>
>>> On 16 March 2018 at 04:29, Pablo Estrada  wrote:
>>>
 Hi!
 Quick questions:
 - which sdk are you using?
 - is this batch or streaming?

 As JB mentioned, TextIO is able to work with compressed files that
 contain text. Nothing currently handles the double decompression that I
 believe you're looking for.
 TextIO for Java is also able to"watch" a directory for new files. If
 you're able to (outside of your pipeline) decompress your first zip file
 into a directory that your pipeline is watching, you may be able to use
 that as work around. Does that sound like a good thing?
 Finally, if you want to implement a transform that does all your logic,
 well then that sounds like SplittableDoFn material; and in that case,
 someone that knows SDF better can give you guidance (or clarify if my
 suggestions are not correct).
 Best
 -P.

 On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré 
 wrote:

> Hi
>
> TextIO supports compressed file. Do you want to read files in text ?
>
> Can you detail a bit the use case ?
>
> Thanks
> Regards
> JB
> Le 15 mars 2018, à 18:28, Shirish Jamthe  a écrit:
>>
>> Hi,
>>
>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>> files and other files.
>> I would lile to extract the tar.gz files from the tar.
>>
>> Is there a transform that can do that? I couldn't find one.
>> If not is it in works? Any pointers to start work on it?
>>
>> thanks
>>
> --
 Got feedback? go/pabloem-feedback
 

>>>
>>>
>


Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Sajeevan Achuthan
Eugene, I ran the code and it works fine.  I am very confident in this
case. I appreciate you guys for the great work.

The code supposed to show that Beam TextIO can read the double compressed
files and write output without any processing. so ignored the processing
steps. I agree with you the further processing is not easy in this case.


import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;

public class ReadCompressedTextFile {

public static void main(String[] args) {
PipelineOptions optios =
PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(optios);

p.apply("ReadLines",
TextIO.read().from("./dataset.tar.gz")

).apply(ParDo.of(new DoFn(){
@ProcessElement
public void processElement(ProcessContext c) {
c.output(c.element());
// Just write the all content to "/tmp/filout/outputfile"
}

}))

   .apply(TextIO.write().to("/tmp/filout/outputfile"));

p.run().waitUntilFinish();
}

}

The full code, data file & output contents are attached.

thanks
Saj





On 16 March 2018 at 16:56, Eugene Kirpichov  wrote:

> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
> handle properly .tar. Did you run this code? Did your test .tar.gz file
> contain multiple files? Did you obtain the expected output, identical to
> the input except for order of lines?
> (also, the ParDo in this code doesn't do anything - it outputs its input -
> so it can be removed)
>
> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
> achuthan.sajee...@gmail.com> wrote:
>
>> Hi Guys,
>>
>> The TextIo can handle the tar.gz type double compressed files. See the
>> code test code.
>>
>>  PipelineOptions optios = PipelineOptionsFactory.
>> fromArgs(args).withValidation().create();
>> Pipeline p = Pipeline.create(optios);
>>
>>* p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>   .apply(ParDo.of(new DoFn(){
>> @ProcessElement
>> public void processElement(ProcessContext c) {
>> c.output(c.element());
>> }
>>
>> }))
>>
>>.apply(TextIO.write().to("/tmp/filout/outputfile"));
>>
>> p.run().waitUntilFinish();
>>
>> Thanks
>> /Saj
>>
>> On 16 March 2018 at 04:29, Pablo Estrada  wrote:
>>
>>> Hi!
>>> Quick questions:
>>> - which sdk are you using?
>>> - is this batch or streaming?
>>>
>>> As JB mentioned, TextIO is able to work with compressed files that
>>> contain text. Nothing currently handles the double decompression that I
>>> believe you're looking for.
>>> TextIO for Java is also able to"watch" a directory for new files. If
>>> you're able to (outside of your pipeline) decompress your first zip file
>>> into a directory that your pipeline is watching, you may be able to use
>>> that as work around. Does that sound like a good thing?
>>> Finally, if you want to implement a transform that does all your logic,
>>> well then that sounds like SplittableDoFn material; and in that case,
>>> someone that knows SDF better can give you guidance (or clarify if my
>>> suggestions are not correct).
>>> Best
>>> -P.
>>>
>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré 
>>> wrote:
>>>
 Hi

 TextIO supports compressed file. Do you want to read files in text ?

 Can you detail a bit the use case ?

 Thanks
 Regards
 JB
 Le 15 mars 2018, à 18:28, Shirish Jamthe  a écrit:
>
> Hi,
>
> My input is a tar.gz or .zip file which contains thousands of tar.gz
> files and other files.
> I would lile to extract the tar.gz files from the tar.
>
> Is there a transform that can do that? I couldn't find one.
> If not is it in works? Any pointers to start work on it?
>
> thanks
>
 --
>>> Got feedback? go/pabloem-feedback
>>> 
>>>
>>
>>


dataset.tar.gz
Description: application/gzip


output.tar
Description: Unix tar archive


Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Eugene Kirpichov
Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
handle properly .tar. Did you run this code? Did your test .tar.gz file
contain multiple files? Did you obtain the expected output, identical to
the input except for order of lines?
(also, the ParDo in this code doesn't do anything - it outputs its input -
so it can be removed)

On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
achuthan.sajee...@gmail.com> wrote:

> Hi Guys,
>
> The TextIo can handle the tar.gz type double compressed files. See the
> code test code.
>
>  PipelineOptions optios =
> PipelineOptionsFactory.fromArgs(args).withValidation().create();
> Pipeline p = Pipeline.create(optios);
>
>* p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>   .apply(ParDo.of(new DoFn(){
> @ProcessElement
> public void processElement(ProcessContext c) {
> c.output(c.element());
> }
>
> }))
>
>.apply(TextIO.write().to("/tmp/filout/outputfile"));
>
> p.run().waitUntilFinish();
>
> Thanks
> /Saj
>
> On 16 March 2018 at 04:29, Pablo Estrada  wrote:
>
>> Hi!
>> Quick questions:
>> - which sdk are you using?
>> - is this batch or streaming?
>>
>> As JB mentioned, TextIO is able to work with compressed files that
>> contain text. Nothing currently handles the double decompression that I
>> believe you're looking for.
>> TextIO for Java is also able to"watch" a directory for new files. If
>> you're able to (outside of your pipeline) decompress your first zip file
>> into a directory that your pipeline is watching, you may be able to use
>> that as work around. Does that sound like a good thing?
>> Finally, if you want to implement a transform that does all your logic,
>> well then that sounds like SplittableDoFn material; and in that case,
>> someone that knows SDF better can give you guidance (or clarify if my
>> suggestions are not correct).
>> Best
>> -P.
>>
>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi
>>>
>>> TextIO supports compressed file. Do you want to read files in text ?
>>>
>>> Can you detail a bit the use case ?
>>>
>>> Thanks
>>> Regards
>>> JB
>>> Le 15 mars 2018, à 18:28, Shirish Jamthe  a écrit:

 Hi,

 My input is a tar.gz or .zip file which contains thousands of tar.gz
 files and other files.
 I would lile to extract the tar.gz files from the tar.

 Is there a transform that can do that? I couldn't find one.
 If not is it in works? Any pointers to start work on it?

 thanks

>>> --
>> Got feedback? go/pabloem-feedback
>> 
>>
>
>


Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Sajeevan Achuthan
Hi Guys,

The TextIo can handle the tar.gz type double compressed files. See the code
test code.

 PipelineOptions optios =
PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(optios);

   * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
  .apply(ParDo.of(new DoFn(){
@ProcessElement
public void processElement(ProcessContext c) {
c.output(c.element());
}

}))

   .apply(TextIO.write().to("/tmp/filout/outputfile"));

p.run().waitUntilFinish();

Thanks
/Saj
On 16 March 2018 at 04:29, Pablo Estrada  wrote:

> Hi!
> Quick questions:
> - which sdk are you using?
> - is this batch or streaming?
>
> As JB mentioned, TextIO is able to work with compressed files that contain
> text. Nothing currently handles the double decompression that I believe
> you're looking for.
> TextIO for Java is also able to"watch" a directory for new files. If
> you're able to (outside of your pipeline) decompress your first zip file
> into a directory that your pipeline is watching, you may be able to use
> that as work around. Does that sound like a good thing?
> Finally, if you want to implement a transform that does all your logic,
> well then that sounds like SplittableDoFn material; and in that case,
> someone that knows SDF better can give you guidance (or clarify if my
> suggestions are not correct).
> Best
> -P.
>
> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi
>>
>> TextIO supports compressed file. Do you want to read files in text ?
>>
>> Can you detail a bit the use case ?
>>
>> Thanks
>> Regards
>> JB
>> Le 15 mars 2018, à 18:28, Shirish Jamthe  a écrit:
>>>
>>> Hi,
>>>
>>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>>> files and other files.
>>> I would lile to extract the tar.gz files from the tar.
>>>
>>> Is there a transform that can do that? I couldn't find one.
>>> If not is it in works? Any pointers to start work on it?
>>>
>>> thanks
>>>
>> --
> Got feedback? go/pabloem-feedback
>


Re: Looking for I/O transform to untar a tar.gz

2018-03-15 Thread Pablo Estrada
Hi!
Quick questions:
- which sdk are you using?
- is this batch or streaming?

As JB mentioned, TextIO is able to work with compressed files that contain
text. Nothing currently handles the double decompression that I believe
you're looking for.
TextIO for Java is also able to"watch" a directory for new files. If you're
able to (outside of your pipeline) decompress your first zip file into a
directory that your pipeline is watching, you may be able to use that as
work around. Does that sound like a good thing?
Finally, if you want to implement a transform that does all your logic,
well then that sounds like SplittableDoFn material; and in that case,
someone that knows SDF better can give you guidance (or clarify if my
suggestions are not correct).
Best
-P.

On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré  wrote:

> Hi
>
> TextIO supports compressed file. Do you want to read files in text ?
>
> Can you detail a bit the use case ?
>
> Thanks
> Regards
> JB
> Le 15 mars 2018, à 18:28, Shirish Jamthe  a écrit:
>>
>> Hi,
>>
>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>> files and other files.
>> I would lile to extract the tar.gz files from the tar.
>>
>> Is there a transform that can do that? I couldn't find one.
>> If not is it in works? Any pointers to start work on it?
>>
>> thanks
>>
> --
Got feedback? go/pabloem-feedback


Re: Looking for I/O transform to untar a tar.gz

2018-03-15 Thread Jean-Baptiste Onofré
Hi

TextIO supports compressed file. Do you want to read files in text ?

Can you detail a bit the use case ?

Thanks
Regards
JB

Le 15 mars 2018 à 18:28, à 18:28, Shirish Jamthe  a écrit:
>Hi,
>
>My input is a tar.gz or .zip file which contains thousands of tar.gz
>files
>and other files.
>I would lile to extract the tar.gz files from the tar.
>
>Is there a transform that can do that? I couldn't find one.
>If not is it in works? Any pointers to start work on it?
>
>thanks