Re: Question about my use case.

Ryan Blue Wed, 14 Mar 2018 09:07:22 -0700

Hi Alex,

I don't think what you're trying to do makes sense. If you're using Scala,
then your data is already in the JVM and it is probably much easier to
write it to Parquet using the Java library. While that library depends on
Hadoop, you don't have to use it with HDFS. The Hadoop FileSystem interface
can be used to write directly to local disk or a number of other stores,
like S3. Using the Java library would allow you to write the data directly,
instead of translating to Arrow first.


Since you want to use Scala, then the easiest way to get this support is
probably to write using Spark, which has most of what you need ready to go.
If you're using a different streaming system you might not want both. What
are you using?

rb

On Tue, Mar 13, 2018 at 6:11 PM, ALeX Wang <ee07b...@gmail.com> wrote:

> Also could i get a pointer to example that write parquet file from arrow
> memory buffer directly?
>
> The part i'm currently missing is how to derive the repetition level and
> definition level@@
>
> Thanks,
>
> On 13 March 2018 at 17:52, ALeX Wang <ee07b...@gmail.com> wrote:
>
> > hi,
> >
> > i know it is may not be the best place to ask but would like to try
> > anyways, as it is quite hard for me to find good example of this online.
> >
> > My usecase:
> >
> > i'd like to generate from streaming data (using Scala) into arrow format
> > in memory mapped file and then have my parquet-cpp program writing it as
> > parquet file to disk.
> >
> > my understanding is that java parquet only implements HDFS writer, which
> > is not my use case (not using hadoop) and parquet-cpp is much more
> > succinct.
> >
> > My question:
> >
> > does my usecase make sense? or if there is better way?
> >
> > Thanks,
> > --
> > Alex Wang,
> > Open vSwitch developer
> >
>
>
>
> --
> Alex Wang,
> Open vSwitch developer
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Question about my use case.

Reply via email to