Hm I am not very familiar with POI, but if its transforms are able to take in a file descriptor, you should be able to use FileIO.match()[0] to find your files (local, or in GCS/S3/HDFS); and FileIO.readMatches()[1] to get file descriptors for these files.
If the POI libraries require the files to be local in your machine, you may need to use FileSystems.copy[2] to move your files locally, and then analyze them. Let me know if those are some useful building blocks for your pipeline, Best -P. [0] https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#match-- [1] https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#readMatches-- [2] https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileSystems.html#copy-java.util.List-java.util.List-org.apache.beam.sdk.io.fs.MoveOptions...- On Mon, Apr 15, 2019 at 6:20 PM Henrique Molina <[email protected]> wrote: > Hi Pablo , > Thanks for your attention, > I so sorry, my bad written "Cs extension " I did means .csv extension ! > The example like this: load-csv-file-from-google-cloud-storage > <https://kontext.tech/docs/DataAndBusinessIntelligence/p/load-csv-file-from-google-cloud-storage-to-bigquery-using-dataflow> > > I was think Using apache POI to read each row from sheet throwing to next > ParDo an CellRow rows > same like that: > .apply("xlsxToMap", ParDo.of(new DoFn<CellRow, Map<String,String>() {..... > > I don't know if it is more ellegant... > > If your have some Idea ! let me know . it will be welcome!! > > > On Mon, Apr 15, 2019 at 6:01 PM Pablo Estrada <[email protected]> wrote: > >> Hello Henrique, >> >> I am not aware of existing Beam transforms specifically used for reading >> in XLSX data. Can you share what you mean by "examples related with Cs >> extension"? >> >> I am aware of some Python libraries foir this sort of thing[1]. You could >> use the FileIO transforms in the Python SDK to find each file, and then >> write a DoFn that is able to read in data from these files. Check out this >> unit test using FileIO to read CSV files[2]. >> >> Let me know if that helps, or if I went on the wrong direction of what >> you needed. >> Best >> -P. >> >> [1] https://openpyxl.readthedocs.io/en/stable/ >> [2] >> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148 >> >> On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina < >> [email protected]> wrote: >> >>> Hello >>> >>> I would like to use best practices from Apache Beams to read Xlsx. >>> however I found examples only related with Cs extension. >>> someone there is sample using ParDo to Collect all columns and sheets >>> from Excel xlsx ? >>> Afterwards I will put into google Big query. >>> >>> Thanks & Regards >>> >>> >>
