Re: pipeline with parquet and sql

2018-08-01 Thread Chamikara Jayalath
On Wed, Aug 1, 2018 at 1:12 AM Akanksha Sharma B <
akanksha.b.sha...@ericsson.com> wrote:

> Hi,
>
>
> Thanks. I understood the Parquet point. I will wait for couple of days on
> this topic. Even if this scenario cannot be achieved now, any design
> document or future plans towards this direction will also be helpful to me.
>
>
> To summarize, I do not understand beam well enough, can someone please
> help me and comment whether the following fits with beam's model and
> future direction :-
>
> "read parquet (along with inferred schema) into something like dataframe
> or Beam Rows. And vice versa for write i.e. get rows and write parquet
> based on Row's schema."
>

Beam currently does not have a standard message format. A Beam pipeline
consists of PCollections and transforms (that converts PCollections to
other PCollections). You can transform the PCollection read from Parquet
using a ParDo and writing the resulting transform back to Parquet format. I
think Schema aware PCollections [1] might be close to what you need but not
sure if it fulfills your exact requirement.

Thanks,
Cham

[1]
https://lists.apache.org/thread.html/fe327866c6c81b7e55af28f81cedd9b2e588279def330940e8b8ebd7@%3Cdev.beam.apache.org%3E


>
>
>
> Regards,
>
> Akanksha
>
>
> --
> *From:* Łukasz Gajowy 
> *Sent:* Tuesday, July 31, 2018 12:43:32 PM
> *To:* u...@beam.apache.org
> *Cc:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
> In terms of schema and ParquetIO source/sink, there was an answer in some
> previous thread:
>
> Currently (without introducing any change in ParquetIO) there is no way to
> not pass the avro schema. It will probably be replaced with Beam's schema
> in the future ()
>
> [1]
> https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
>
>
> wt., 31 lip 2018 o 10:19 Akanksha Sharma B 
> napisał(a):
>
> Hi,
>
>
> I am hoping to get some hints/pointers from the experts here.
>
> I hope the scenario described below was understandable. I hope it is a
> valid use-case. Please let me know if I need to explain the scenario
> better.
>
>
> Regards,
>
> Akanksha
>
> --
> *From:* Akanksha Sharma B
> *Sent:* Friday, July 27, 2018 9:44 AM
> *To:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
> --
> *From:* Akanksha Sharma B
> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
> *To:* u...@beam.apache.org
> *Subject:* pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
>
>
>


Re: pipeline with parquet and sql

2018-07-31 Thread Łukasz Gajowy
Sorry, I sent not finished message.

In terms of schema and ParquetIO source/sink, there was an answer in some
previous thread [1]. Currently (without introducing any change in
ParquetIO) there is no way to not pass the avro schema. It will probably be
replaced with Beam's schema in the future [2].

[1]
https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/BEAM-4812

wt., 31 lip 2018 o 12:43 Łukasz Gajowy  napisał(a):

> In terms of schema and ParquetIO source/sink, there was an answer in some
> previous thread:
>
> Currently (without introducing any change in ParquetIO) there is no way to
> not pass the avro schema. It will probably be replaced with Beam's schema
> in the future ()
>
> [1]
> https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
>
>
> wt., 31 lip 2018 o 10:19 Akanksha Sharma B 
> napisał(a):
>
>> Hi,
>>
>>
>> I am hoping to get some hints/pointers from the experts here.
>>
>> I hope the scenario described below was understandable. I hope it is a
>> valid use-case. Please let me know if I need to explain the scenario
>> better.
>>
>>
>> Regards,
>>
>> Akanksha
>>
>> --
>> *From:* Akanksha Sharma B
>> *Sent:* Friday, July 27, 2018 9:44 AM
>> *To:* dev@beam.apache.org
>> *Subject:* Re: pipeline with parquet and sql
>>
>>
>> Hi,
>>
>>
>> Please consider following pipeline:-
>>
>>
>> Source is Parquet file, having hundreds of columns.
>>
>> Sink is Parquet. Multiple output parquet files are generated after
>> applying some sql joins. Sql joins to be applied differ for each output
>> parquet file. Lets assume we have a sql queries generator or some
>> configuration file with the needed info.
>>
>>
>> Can this be implemented generically, such that there is no need of the
>> schema of the parquet files involved or any intermediate POJO or beam
>> schema.
>>
>> i.e. the way spark can handle it - read parquet into dataframe, create
>> temp view and apply sql queries to it, and write it back to parquet.
>>
>> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO
>> needs avro schemas. Ideally we dont want to see POJOs or schemas.
>> If there is a way we can achieve this with beam, please do help.
>>
>> Regards,
>> Akanksha
>>
>> --
>> *From:* Akanksha Sharma B
>> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
>> *To:* u...@beam.apache.org
>> *Subject:* pipeline with parquet and sql
>>
>>
>> Hi,
>>
>>
>> Please consider following pipeline:-
>>
>>
>> Source is Parquet file, having hundreds of columns.
>>
>> Sink is Parquet. Multiple output parquet files are generated after
>> applying some sql joins. Sql joins to be applied differ for each output
>> parquet file. Lets assume we have a sql queries generator or some
>> configuration file with the needed info.
>>
>>
>> Can this be implemented generically, such that there is no need of the
>> schema of the parquet files involved or any intermediate POJO or beam
>> schema.
>>
>> i.e. the way spark can handle it - read parquet into dataframe, create
>> temp view and apply sql queries to it, and write it back to parquet.
>>
>> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO
>> needs avro schemas. Ideally we dont want to see POJOs or schemas.
>> If there is a way we can achieve this with beam, please do help.
>>
>> Regards,
>> Akanksha
>>
>>
>>
>>


Re: pipeline with parquet and sql

2018-07-31 Thread Łukasz Gajowy
In terms of schema and ParquetIO source/sink, there was an answer in some
previous thread:

Currently (without introducing any change in ParquetIO) there is no way to
not pass the avro schema. It will probably be replaced with Beam's schema
in the future ()

[1]
https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E


wt., 31 lip 2018 o 10:19 Akanksha Sharma B 
napisał(a):

> Hi,
>
>
> I am hoping to get some hints/pointers from the experts here.
>
> I hope the scenario described below was understandable. I hope it is a
> valid use-case. Please let me know if I need to explain the scenario
> better.
>
>
> Regards,
>
> Akanksha
>
> --
> *From:* Akanksha Sharma B
> *Sent:* Friday, July 27, 2018 9:44 AM
> *To:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
> --------------
> *From:* Akanksha Sharma B
> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
> *To:* u...@beam.apache.org
> *Subject:* pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
>
>
>