[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies

ASF GitHub Bot (Jira) Tue, 13 Jun 2023 00:50:10 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731931#comment-17731931
 ]


ASF GitHub Bot commented on PARQUET-1822:
-----------------------------------------

amousavigourabi commented on PR #1111:
URL: https://github.com/apache/parquet-mr/pull/1111#issuecomment-1588736694

   > I don't want to sound too greedy, but the next level of this feature would 
be if the classes in question have no imports of Hadoop in them. Something 
like: Parquet(with Hadoop) -> Parquet(with java.nio.File) And the lower level 
classes are a jar of their own.
   > 
   > Just dreaming...
   
   One day... The next step is to get rid of the tight coupling to the other 
Hadoop classes (mainly Configuration) as that shouldn't break anything, which 
would at least allow users to drop hadoop-client-runtime. But first, this PR is 
to allow users to avoid the bigger Hadoop issues more easily.




> Parquet without Hadoop dependencies
> -----------------------------------
>
>                 Key: PARQUET-1822
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1822
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-avro
>    Affects Versions: 1.11.0
>         Environment: Amazon Fargate (linux), Windows development box.
> We are writing Parquet to be read by the Snowflake and Athena databases.
>            Reporter: mark juchems
>            Priority: Minor
>              Labels: documentation, newbie
>
> I have been trying for weeks to create a parquet file from avro and write to 
> S3 in Java.  This has been incredibly frustrating and odd as Spark can do it 
> easily (I'm told).
> I have assembled the correct jars through luck and diligence, but now I find 
> out that I have to have hadoop installed on my machine. I am currently 
> developing in Windows and it seems a dll and exe can fix that up but am 
> wondering about Linus as the code will eventually run in Fargate on AWS.
> *Why do I need external dependencies and not pure java?*
> The thing really is how utterly complex all this is.  I would like to create 
> an avro file and convert it to Parquet and write it to S3, but I am trapped 
> in "ParquetWriter" hell! 
> *Why can't I get a normal OutputStream and write it wherever I want?*
> I have scoured the web for examples and there are a few but we really need 
> some documentation on this stuff.  I understand that there may be reasons for 
> all this but I can't find them on the web anywhere.  Any help?  Can't we get 
> the "SimpleParquet" jar that does this:
>  
> ParquetWriter writer = 
> AvroParquetWriter.<GenericData.Record>builder(outputStream)
>  .withSchema(avroSchema)
>  .withConf(conf)
>  .withCompressionCodec(CompressionCodecName.SNAPPY)
>  .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites 
> files).
>  .build();
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies

Reply via email to