[
https://issues.apache.org/jira/browse/PARQUET-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155675#comment-17155675
]
Eric Gorelik commented on PARQUET-1882:
---------------------------------------
Here's a minimal one.
{code:c++}
#include <arrow/io/api.h>
#include <parquet/api/writer.h>
#include <parquet/api/reader.h>
using namespace parquet;
using namespace parquet::schema;
int main()
{
auto primitiveNode = PrimitiveNode::Make("nulls", Repetition::OPTIONAL,
nullptr, Type::INT32);
NodeVector columns({ primitiveNode });
auto rootNode = GroupNode::Make("root", Repetition::REQUIRED, columns,
nullptr);
std::shared_ptr<arrow::io::OutputStream> fileOut;
arrow::io::FileOutputStream::Open("test.parquet", &fileOut);
auto fileWriter = ParquetFileWriter::Open(fileOut,
std::static_pointer_cast<GroupNode>(rootNode));
auto rowGroupWriter = fileWriter->AppendRowGroup();
auto columnWriter = static_cast<Int32Writer*>(rowGroupWriter->NextColumn());
int32_t values[3];
int16_t defLevels[] = { 0, 0, 0 };
columnWriter->WriteBatch(3, defLevels, nullptr, values);
columnWriter->Close();
rowGroupWriter->Close();
fileWriter->Close();
fileOut->Close();
ReaderProperties props = default_reader_properties();
props.enable_buffered_stream();
auto fileReader = ParquetFileReader::OpenFile("test.parquet", true, props);
auto rowGroupReader = fileReader->RowGroup(0);
auto columnReader =
std::static_pointer_cast<Int32Reader>(rowGroupReader->Column(0));
int64_t valuesRead;
columnReader->ReadBatch(3, defLevels, nullptr, values, &valuesRead);
}
{code}
> Writing an all-null column and then reading it with buffered_stream aborts
> the process
> --------------------------------------------------------------------------------------
>
> Key: PARQUET-1882
> URL: https://issues.apache.org/jira/browse/PARQUET-1882
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Environment: Windows 10 64-bit, MSVC
> Reporter: Eric Gorelik
> Priority: Critical
>
> When writing a column unbuffered that contains only nulls, a 0-byte
> dictionary page gets written. When then reading the resulting file with
> buffered_stream enabled, the column reader gets the length of the page (which
> is 0), and then tries to read that many bytes from the underlying input
> stream.
> parquet/column_reader.cc, SerializedPageReader::NextPage
>
> {code:java}
> int compressed_len = current_page_header_.compressed_page_size;
> int uncompressed_len = current_page_header_.uncompressed_page_size;
> // Read the compressed data page.
> std::shared_ptr<Buffer> page_buffer;
> PARQUET_THROW_NOT_OK(stream_->Read(compressed_len, &page_buffer));{code}
>
> BufferedInputStream::Read, however, has an assertion that the bytes to read
> is strictly positive, so the assertion fails and aborts the process.
> arrow/io/buffered.cc, BufferedInputStream::Impl
>
> {code:java}
> Status Read(int64_t nbytes, int64_t* bytes_read, void* out) {
> ARROW_CHECK_GT(nbytes, 0);
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)