Re: [Python] Read Hadoop Sequence File?

Shannon Duncan Wed, 10 Jul 2019 07:58:39 -0700

If I wanted to go ahead and include this within a new Java Pipeline, what
would I be looking at for level of work to integrate?


On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía <ieme...@gmail.com> wrote:

> That's great. I can help whenever you need. We just need to choose its
> destination. Both the `hadoop-format` and `hadoop-file-system` modules
> are good candidates, I would even feel inclined to put it in its own
> module `sdks/java/extensions/sequencefile` to make it more easy to
> discover by the final users.
>
> A thing to consider is the SeekableByteChannel adapters, we can move
> that into hadoop-common if needed and refactor the modules to share
> code. Worth to take a look at
>
> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
> to see if some of it could be useful.
>
> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein <igorbernst...@google.com>
> wrote:
> >
> > Hi all,
> >
> > I wrote those classes with the intention of upstreaming them to Beam. I
> can try to make some time this quarter to clean them up. I would need a bit
> of guidance from a beam expert in how to make them coexist with
> HadoopFormatIO though.
> >
> >
> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis <sdus...@google.com>
> wrote:
> >>
> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
> >>
> >> Solomon Duskis | Google Cloud clients | sdus...@google.com |
> 914-462-0531
> >>
> >>
> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía <ieme...@gmail.com> wrote:
> >>>
> >>> (Adding dev@ and Solomon Duskis to the discussion)
> >>>
> >>> I was not aware of these thanks for sharing David. Definitely it would
> >>> be a great addition if we could have those donated as an extension in
> >>> the Beam side. We can even evolve them in the future to be more FileIO
> >>> like. Any chance this can happen? Maybe Solomon and his team?
> >>>
> >>>
> >>>
> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek <d...@apache.org> wrote:
> >>> >
> >>> > Hi, you can use SequenceFileSink and Source, from a BigTable client.
> Those works nice with FileIO.
> >>> >
> >>> >
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
> >>> >
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
> >>> >
> >>> > It would be really cool to move these into Beam, but that's up to
> Googlers to decide, whether they want to donate this.
> >>> >
> >>> > D.
> >>> >
> >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
> >>> >>
> >>> >> It's not outside the realm of possibilities. For now I've created
> an intermediary step of a hadoop job that converts from sequence to text
> file.
> >>> >>
> >>> >> Looking into better options.
> >>> >>
> >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>> >>>
> >>> >>> Java SDK has a HadoopInputFormatIO using which you should be able
> to read Sequence files:
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
> >>> >>> I don't think there's a direct alternative for this for Python.
> >>> >>>
> >>> >>> Is it possible to write to a well-known format such as Avro
> instead of a Hadoop specific format which will allow you to read from both
> Dataproc/Hadoop and Beam Python SDK ?
> >>> >>>
> >>> >>> Thanks,
> >>> >>> Cham
> >>> >>>
> >>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
> >>> >>>>
> >>> >>>> That's a pretty big hole for a missing source/sink when looking
> at transitioning from Dataproc to Dataflow using GCS as storage buffer
> instead of a traditional hdfs.
> >>> >>>>
> >>> >>>> From what I've been able to tell from source code and
> documentation, Java is able to but not Python?
> >>> >>>>
> >>> >>>> Thanks,
> >>> >>>> Shannon
> >>> >>>>
> >>> >>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>> >>>>>
> >>> >>>>> I don't think we have a source/sink for reading Hadoop sequence
> files. Your best bet currently will probably be to use FileSystem
> abstraction to create a file from a ParDo and read directly from there
> using a library that can read sequence files.
> >>> >>>>>
> >>> >>>>> Thanks,
> >>> >>>>> Cham
> >>> >>>>>
> >>> >>>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
> >>> >>>>>>
> >>> >>>>>> I'm wanting to read a Sequence/Map file from Hadoop stored on
> Google Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the
> Python SDK.
> >>> >>>>>>
> >>> >>>>>> I cannot locate any good adapters for this, and the one Hadoop
> Filesystem reader seems to only read from a "hdfs://" url.
> >>> >>>>>>
> >>> >>>>>> I'm wanting to use Dataflow and GCS exclusively to start mixing
> in Beam pipelines with our current Hadoop Pipelines.
> >>> >>>>>>
> >>> >>>>>> Is this a feature that is supported or will be supported in the
> future?
> >>> >>>>>> Does anyone have any good suggestions for this that is
> performant?
> >>> >>>>>>
> >>> >>>>>> I'd also like to be able to write back out to a SequenceFile if
> possible.
> >>> >>>>>>
> >>> >>>>>> Thanks!
> >>> >>>>>>
>

Re: [Python] Read Hadoop Sequence File?

Reply via email to