[
https://issues.apache.org/jira/browse/PARQUET-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15555798#comment-15555798
]
Florian Scheibner commented on PARQUET-739:
-------------------------------------------
Here's an adapted version of parquet-reader to reproduce. The files need to
contain rle dictionary encoded text columns. I'll try to create some test files.
I've looked through the code and it seems that unpack32 writes to a stack
buffer created by GetBatchWithDict(). I don't see how that could collide in
multiple threads.
{code}
#include <iostream>
#include <fstream>
#include <list>
#include <memory>
#include <thread>
#include "parquet/api/reader.h"
#include "parquet/column/scanner.h"
#include "parquet/file/reader.h"
#include <cstdio>
#include <sstream>
#include <string>
#include <utility>
#include <vector>
int main(int argc, char** argv)
{
if (argc > 5 || argc < 2)
{
std::cerr << "Usage: parquet_reader <file...>"
<< std::endl;
return -1;
}
std::string filename;
std::vector <std::thread> readers;
// Read command-line options
for (int i = 1; i < argc; i++)
{
filename = argv[i];
readers.emplace_back(
[filename]()
{
std::ofstream stream("/dev/null");
parquet::TrackingAllocator allocator;
parquet::ReaderProperties props(&allocator);
bool memory_map = true;
std::unique_ptr <parquet::ParquetFileReader> reader =
parquet::ParquetFileReader::OpenFile(filename,
memory_map, props);
const parquet::FileMetaData *file_metadata = reader->metadata();
for (int r = 0;
r < file_metadata->num_row_groups(); ++r)
{
auto group_reader = reader->RowGroup(r);
// Create readers for selected columns and print contents
std::vector <std::shared_ptr<parquet::Scanner>> scanners;
for (int j = 0; j <
file_metadata->schema()->num_columns(); ++j)
{
std::shared_ptr <parquet::ColumnReader> col_reader =
group_reader->Column(j);
// This is OK in this method as long as the RowGroupReader does not
get
// deleted
scanners.push_back(
parquet::Scanner::Make(col_reader,
parquet::DEFAULT_SCANNER_BATCH_SIZE, &allocator));
}
bool hasRow;
do
{
hasRow = false;
for (auto scanner : scanners)
{
if (scanner->HasNext())
{
hasRow = true;
scanner->PrintNext(stream, 100);
}
}
} while (hasRow);
}
});
}
for (auto &t : readers)
{
t.join();
}
return 0;
}
{code}
> Read after free with uncompressed page
> --------------------------------------
>
> Key: PARQUET-739
> URL: https://issues.apache.org/jira/browse/PARQUET-739
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Reporter: Florian Scheibner
> Assignee: Florian Scheibner
>
> Reading two parquet files in parallel lead to a memory corruption that caused
> a crash. The columns are rle dictionary encoded strings in an uncompressed
> page, created with parquet-mr. -fsanitize tracked the issue to a use-after
> free:
> {code}
> =================================================================
> ==81678==ERROR: AddressSanitizer: heap-use-after-free on address
> 0x6060001088c0 at pc 0x000003dbd42b bp 0x7fffe30fbe00 sp 0x7fffe30fbdf8
> READ of size 16 at 0x6060001088c0 thread T8
> #0 0x3dbd42a in int
> parquet::RleDecoder::GetBatchWithDict<parquet::ByteArray>(parquet::Vector<parquet::ByteArray>
> const&, parquet::ByteArray*, int)
> (/home/fscheibner/Snowflake/ExecPlatform/bin/snowflake+0x3dbd42a)
> #1 0x3db8efa in
> parquet::DictionaryDecoder<parquet::DataType<(parquet::Type::type)6>
> >::Decode(parquet::ByteArray*, int)
> (/home/fscheibner/Snowflake/ExecPlatform/bin/snowflake+0x3db8efa)
> #2 0x3d84767 in
> parquet::TypedColumnReader<parquet::DataType<(parquet::Type::type)6>
> >::ReadValues(long, parquet::ByteArray*)
> (/home/fscheibner/Snowflake/ExecPlatform/bin/snowflake+0x3d84767)
> #3 0x3d83497 in
> parquet::TypedColumnReader<parquet::DataType<(parquet::Type::type)6>
> >::ReadBatch(int, short*, short*, parquet::ByteArray*, long*)
> (/home/fscheibner/Snowflake/ExecPlatform/bin/snowflake+0x3d83497)
> {code}
> Initial debugging showed that the indices for the dictionary returned by the
> rle decoder are garbage. So that data page got corrupted in memory. Reading
> the files in one thread works.
> I have a ColumnReader for each column and read one element from reach column
> to get a complete row.
> My guess is that some data buffer is freed and then later still used for
> reading. I couldn't track the source yet. Any ideas [~wesmckinn]?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)