Andrei Iatsuk created PARQUET-1237:
--------------------------------------
Summary: Reading big texts cause OutOfMemmory Error. How to read
text partialy?
Key: PARQUET-1237
URL: https://issues.apache.org/jira/browse/PARQUET-1237
Project: Parquet
Issue Type: Bug
Components: parquet-avro
Affects Versions: 1.8.1
Environment: I have dataset with big strings (every record about 15
mb) in parquet.
When I try to open all parquet parts I get OutOfMemory exception.
How can I get only headers (first 100 symbols) for each string record without
reading all record?
{code:java}
Schema avroProj = SchemaBuilder.builder()
.record("proj").fields()
.name("idx").type().nullable().longType().noDefault()
.name("text").type().nullable().bytesType().noDefault()
.endRecord();
Configuration conf = new Configuration();
AvroReadSupport.setRequestedProjection(conf, avroProj);
ParquetReader<GenericRecord> parquetReader = AvroParquetReader
.<GenericRecord>builder(new Path(filePath))
.withConf(conf)
.build();
GenericRecord record = parquetReader.read();
// record already have full text in RAM
Long idx = (Long) record.get("idx");
ByteBuffer rawText = (ByteBuffer) record.get("text");
String header = new String(rawText.array()).substring(0, 200);
{code}
Reporter: Andrei Iatsuk
I have dataset with big strings (every record about 15 mb) in parquet.
When I try to open all parquet parts I get OutOfMemory exception.
How can I get only headers (first 100 symbols) for each string record without
reading all record?
Schema avroProj = SchemaBuilder.builder()
.record("proj").fields()
.name("idx").type().nullable().longType().noDefault()
.name("text").type().nullable().bytesType().noDefault()
.endRecord();
Configuration conf = new Configuration();
AvroReadSupport.setRequestedProjection(conf, avroProj);
ParquetReader<GenericRecord> parquetReader = AvroParquetReader
.<GenericRecord>builder(new Path(filePath))
.withConf(conf)
.build();
GenericRecord record = parquetReader.read(); // record already have full text
in RAM
Long idx = (Long) record.get("idx");
ByteBuffer rawText = (ByteBuffer) record.get("text");
String header = new String(rawText.array()).substring(0, 200);
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)