[jira] [Commented] (PARQUET-860) ParquetWriter.getDataSize NullPointerException after closed

Steve Loughran (Jira) Wed, 29 May 2024 12:27:29 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850480#comment-17850480
 ]


Steve Loughran commented on PARQUET-860:
----------------------------------------

bq. ut Amazon S3 is a cloud service that represents a distributed storage 
system for objects. There is no permanent connection to the saving object. Any 
network problems or service failures can cause temporary errors or timeout. You 
write that my application code cares about saving the original data. Are these 
data already buffered in memory in objects created from ParquetWriter? Do I 
need to duplicate copy input data in the application?

s3a FS puts a lot of effort into retry and recovery in close() because it is so 
critical. One thing to note is that too much code assumes that close() is 
fast...it often isn't and if the thread is sending heartbeats back they can 
time out. If you set a progress callback on the FSDataOutputStream then we will 
actually invoke it after every queued block is uploaded.

> ParquetWriter.getDataSize NullPointerException after closed
> -----------------------------------------------------------
>
>                 Key: PARQUET-860
>                 URL: https://issues.apache.org/jira/browse/PARQUET-860
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.9.0
>         Environment: Linux prim 4.8.13-1-ARCH #1 SMP PREEMPT Fri Dec 9 
> 07:24:34 CET 2016 x86_64 GNU/Linux
> openjdk version "1.8.0_112"
> OpenJDK Runtime Environment (build 1.8.0_112-b15)
> OpenJDK 64-Bit Server VM (build 25.112-b15, mixed mode)
>            Reporter: Mike Mintz
>            Priority: Major
>
> When I run {{ParquetWriter.getDataSize()}}, it works normally. But after I 
> call {{ParquetWriter.close()}}, subsequent calls to ParquetWriter.getDataSize 
> result in a NullPointerException.
> {noformat}
> java.lang.NullPointerException
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.getDataSize(InternalParquetRecordWriter.java:132)
>       at 
> org.apache.parquet.hadoop.ParquetWriter.getDataSize(ParquetWriter.java:314)
>       at FileBufferState.getFileSizeInBytes(FileBufferState.scala:83)
> {noformat}
> The reason for the NPE appears to be in 
> {{InternalParquetRecordWriter.getDataSize}}, where it assumes that 
> {{columnStore}} is not null.
> But the {{close()}} method calls {{flushRowGroupToStore()}} which sets 
> {{columnStore = null}}.
> I'm guessing that once the file is closed, we can just return 
> {{lastRowGroupEndPos}} since there should be no more buffered data, but I 
> don't fully understand how this class works.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PARQUET-860) ParquetWriter.getDataSize NullPointerException after closed

Reply via email to