[ 
https://issues.apache.org/jira/browse/PARQUET-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370435#comment-16370435
 ] 

e.birukov commented on PARQUET-860:
-----------------------------------

[~rdblue], You wrote "Most of the time, we assume that an exception in close is 
not recoverable and the entire file needs to be rewritten". This is obvious 
when writing data to the local file system, when the file is opened for writing 
when creating ParquetWriter. But Amazon S3 is a cloud service that represents a 
distributed storage system for objects. There is no permanent connection to the 
saving object. Any network problems or service failures can cause temporary 
errors or timeout. You write that my application code cares about saving the 
original data. Are these data already buffered in memory in objects created 
from ParquetWriter? Do I need to duplicate copy input data in the application?

> ParquetWriter.getDataSize NullPointerException after closed
> -----------------------------------------------------------
>
>                 Key: PARQUET-860
>                 URL: https://issues.apache.org/jira/browse/PARQUET-860
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.9.0
>         Environment: Linux prim 4.8.13-1-ARCH #1 SMP PREEMPT Fri Dec 9 
> 07:24:34 CET 2016 x86_64 GNU/Linux
> openjdk version "1.8.0_112"
> OpenJDK Runtime Environment (build 1.8.0_112-b15)
> OpenJDK 64-Bit Server VM (build 25.112-b15, mixed mode)
>            Reporter: Mike Mintz
>            Priority: Major
>
> When I run {{ParquetWriter.getDataSize()}}, it works normally. But after I 
> call {{ParquetWriter.close()}}, subsequent calls to ParquetWriter.getDataSize 
> result in a NullPointerException.
> {noformat}
> java.lang.NullPointerException
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.getDataSize(InternalParquetRecordWriter.java:132)
>       at 
> org.apache.parquet.hadoop.ParquetWriter.getDataSize(ParquetWriter.java:314)
>       at FileBufferState.getFileSizeInBytes(FileBufferState.scala:83)
> {noformat}
> The reason for the NPE appears to be in 
> {{InternalParquetRecordWriter.getDataSize}}, where it assumes that 
> {{columnStore}} is not null.
> But the {{close()}} method calls {{flushRowGroupToStore()}} which sets 
> {{columnStore = null}}.
> I'm guessing that once the file is closed, we can just return 
> {{lastRowGroupEndPos}} since there should be no more buffered data, but I 
> don't fully understand how this class works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to