Of course. Feel free to add a comment to JIRA and send out a pull request for this. Can one of the JIRA admins assign this to Sajeevan ?
Thanks, Cham On Fri, Mar 16, 2018 at 4:22 PM Sajeevan Achuthan < achuthan.sajee...@gmail.com> wrote: > Hi Guys, > > Can I take a look at this issue? If you agree, my Jira id is eachsaj > > thanks > Saj > > > > On 16 March 2018 at 22:13, Chamikara Jayalath <chamik...@google.com> > wrote: > >> Created https://issues.apache.org/jira/browse/BEAM-3867. >> >> Thanks, >> Cham >> >> On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov <kirpic...@google.com> >> wrote: >> >>> Reading can not be parallelized, but processing can be - so there is >>> value in having our file-based sources automatically decompress .tar and >>> .tar.gz. >>> (also, I suspect that many people use Beam even for cases with a modest >>> amount of data, that don't have or need parallelism, just for the sake of >>> convenience of Beam's APIs and IOs) >>> >>> On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath <chamik...@google.com> >>> wrote: >>> >>>> FWIW, if you have a concat gzip file [1] TextIO and other file-based >>>> sources should be able to read that. But we don't support tar files. Is it >>>> possible to perform tar extraction before running the pipeline ? This step >>>> probably cannot be parallelized. So not much value in performing within the >>>> pipeline anyways (other than easy access to various file-systems). >>>> >>>> - Cham >>>> >>>> [1] >>>> https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files >>>> >>>> >>>> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan < >>>> achuthan.sajee...@gmail.com> wrote: >>>> >>>>> Eugene - Yes, you are correct. I tried with a text file & Beam >>>>> wordcount example. The TextIO reader reads some illegal characters as seen >>>>> below. >>>>> >>>>> >>>>> here’s: 1 >>>>> addiction: 1 >>>>> new: 1 >>>>> we: 1 >>>>> mood: 1 >>>>> an: 1 >>>>> incredible: 1 >>>>> swings,: 1 >>>>> known: 1 >>>>> choices.: 1 >>>>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re: >>>>> 1 >>>>> already: 2 >>>>> today: 1 >>>>> the: 3 >>>>> generation: 1 >>>>> wordcount-00002 >>>>> >>>>> >>>>> thanks >>>>> Saj >>>>> >>>>> >>>>> On 16 March 2018 at 17:45, Eugene Kirpichov <kirpic...@google.com> >>>>> wrote: >>>>> >>>>>> To clarify: I think natively supporting .tar and .tar.gz would be >>>>>> quite useful. I'm just saying that currently we don't. >>>>>> >>>>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov < >>>>>> kirpic...@google.com> wrote: >>>>>> >>>>>>> The code behaves as I expected, and the output is corrupt. >>>>>>> Beam unzipped the .gz, but then interpreted the .tar as a text file, >>>>>>> and split the .tar file by \n. >>>>>>> E.g. the first file of the output starts with lines: >>>>>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513 >>>>>>> 5ustar >>>>>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353 >>>>>>> 0ustar eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?> >>>>>>> >>>>>>> which are clearly not the expected input. >>>>>>> >>>>>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan < >>>>>>> achuthan.sajee...@gmail.com> wrote: >>>>>>> >>>>>>>> Eugene, I ran the code and it works fine. I am very confident in >>>>>>>> this case. I appreciate you guys for the great work. >>>>>>>> >>>>>>>> The code supposed to show that Beam TextIO can read the double >>>>>>>> compressed files and write output without any processing. so ignored >>>>>>>> the >>>>>>>> processing steps. I agree with you the further processing is not easy >>>>>>>> in >>>>>>>> this case. >>>>>>>> >>>>>>>> >>>>>>>> import org.apache.beam.sdk.Pipeline; >>>>>>>> import org.apache.beam.sdk.io.TextIO; >>>>>>>> import org.apache.beam.sdk.options.PipelineOptions; >>>>>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory; >>>>>>>> import org.apache.beam.sdk.transforms.DoFn; >>>>>>>> import org.apache.beam.sdk.transforms.ParDo; >>>>>>>> >>>>>>>> public class ReadCompressedTextFile { >>>>>>>> >>>>>>>> public static void main(String[] args) { >>>>>>>> PipelineOptions optios = >>>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create(); >>>>>>>> Pipeline p = Pipeline.create(optios); >>>>>>>> >>>>>>>> p.apply("ReadLines", >>>>>>>> TextIO.read().from("./dataset.tar.gz") >>>>>>>> >>>>>>>> ).apply(ParDo.of(new DoFn<String, String>(){ >>>>>>>> @ProcessElement >>>>>>>> public void processElement(ProcessContext c) { >>>>>>>> c.output(c.element()); >>>>>>>> // Just write the all content to "/tmp/filout/outputfile" >>>>>>>> } >>>>>>>> >>>>>>>> })) >>>>>>>> >>>>>>>> .apply(TextIO.write().to("/tmp/filout/outputfile")); >>>>>>>> >>>>>>>> p.run().waitUntilFinish(); >>>>>>>> } >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> The full code, data file & output contents are attached. >>>>>>>> >>>>>>>> thanks >>>>>>>> Saj >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 16 March 2018 at 16:56, Eugene Kirpichov <kirpic...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can >>>>>>>>> not handle properly .tar. Did you run this code? Did your test >>>>>>>>> .tar.gz file >>>>>>>>> contain multiple files? Did you obtain the expected output, identical >>>>>>>>> to >>>>>>>>> the input except for order of lines? >>>>>>>>> (also, the ParDo in this code doesn't do anything - it outputs its >>>>>>>>> input - so it can be removed) >>>>>>>>> >>>>>>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan < >>>>>>>>> achuthan.sajee...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Guys, >>>>>>>>>> >>>>>>>>>> The TextIo can handle the tar.gz type double compressed files. >>>>>>>>>> See the code test code. >>>>>>>>>> >>>>>>>>>> PipelineOptions optios = >>>>>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create(); >>>>>>>>>> Pipeline p = Pipeline.create(optios); >>>>>>>>>> >>>>>>>>>> * p.apply("ReadLines", >>>>>>>>>> TextIO.read().from("/dataset.tar.gz"))* >>>>>>>>>> .apply(ParDo.of(new DoFn<String, String>(){ >>>>>>>>>> @ProcessElement >>>>>>>>>> public void processElement(ProcessContext c) { >>>>>>>>>> c.output(c.element()); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> })) >>>>>>>>>> >>>>>>>>>> .apply(TextIO.write().to("/tmp/filout/outputfile")); >>>>>>>>>> >>>>>>>>>> p.run().waitUntilFinish(); >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> /Saj >>>>>>>>>> >>>>>>>>>> On 16 March 2018 at 04:29, Pablo Estrada <pabl...@google.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi! >>>>>>>>>>> Quick questions: >>>>>>>>>>> - which sdk are you using? >>>>>>>>>>> - is this batch or streaming? >>>>>>>>>>> >>>>>>>>>>> As JB mentioned, TextIO is able to work with compressed files >>>>>>>>>>> that contain text. Nothing currently handles the double >>>>>>>>>>> decompression that >>>>>>>>>>> I believe you're looking for. >>>>>>>>>>> TextIO for Java is also able to"watch" a directory for new >>>>>>>>>>> files. If you're able to (outside of your pipeline) decompress your >>>>>>>>>>> first >>>>>>>>>>> zip file into a directory that your pipeline is watching, you may >>>>>>>>>>> be able >>>>>>>>>>> to use that as work around. Does that sound like a good thing? >>>>>>>>>>> Finally, if you want to implement a transform that does all your >>>>>>>>>>> logic, well then that sounds like SplittableDoFn material; and in >>>>>>>>>>> that >>>>>>>>>>> case, someone that knows SDF better can give you guidance (or >>>>>>>>>>> clarify if my >>>>>>>>>>> suggestions are not correct). >>>>>>>>>>> Best >>>>>>>>>>> -P. >>>>>>>>>>> >>>>>>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré < >>>>>>>>>>> j...@nanthrax.net> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi >>>>>>>>>>>> >>>>>>>>>>>> TextIO supports compressed file. Do you want to read files in >>>>>>>>>>>> text ? >>>>>>>>>>>> >>>>>>>>>>>> Can you detail a bit the use case ? >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Regards >>>>>>>>>>>> JB >>>>>>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sjam...@google.com> >>>>>>>>>>>> a écrit: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> My input is a tar.gz or .zip file which contains thousands of >>>>>>>>>>>>> tar.gz files and other files. >>>>>>>>>>>>> I would lile to extract the tar.gz files from the tar. >>>>>>>>>>>>> >>>>>>>>>>>>> Is there a transform that can do that? I couldn't find one. >>>>>>>>>>>>> If not is it in works? Any pointers to start work on it? >>>>>>>>>>>>> >>>>>>>>>>>>> thanks >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>> Got feedback? go/pabloem-feedback >>>>>>>>>>> <https://goto.google.com/pabloem-feedback> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>> >