Andrei Iatsuk created PARQUET-1237:
--------------------------------------

             Summary: Reading big texts cause OutOfMemmory Error. How to read 
text partialy?
                 Key: PARQUET-1237
                 URL: https://issues.apache.org/jira/browse/PARQUET-1237
             Project: Parquet
          Issue Type: Bug
          Components: parquet-avro
    Affects Versions: 1.8.1
         Environment: I have dataset with big strings (every record about 15 
mb) in parquet.

When I try to open all parquet parts I get OutOfMemory exception.

How can I get only headers (first 100 symbols) for each string record without 
reading all record?

 
{code:java}
  Schema avroProj = SchemaBuilder.builder()
    .record("proj").fields()
    .name("idx").type().nullable().longType().noDefault()
    .name("text").type().nullable().bytesType().noDefault()
    .endRecord();

  Configuration conf = new Configuration();

  AvroReadSupport.setRequestedProjection(conf, avroProj);
  ParquetReader<GenericRecord> parquetReader = AvroParquetReader
    .<GenericRecord>builder(new Path(filePath))
    .withConf(conf)
    .build();

  GenericRecord record = parquetReader.read(); 
  // record already have full text in RAM
  Long idx = (Long) record.get("idx");
  ByteBuffer rawText = (ByteBuffer) record.get("text");
  String header = new String(rawText.array()).substring(0, 200);
{code}
            Reporter: Andrei Iatsuk


 I have dataset with big strings (every record about 15 mb) in parquet.

When I try to open all parquet parts I get OutOfMemory exception.

How can I get only headers (first 100 symbols) for each string record without 
reading all record?

 

  Schema avroProj = SchemaBuilder.builder()

    .record("proj").fields()

    .name("idx").type().nullable().longType().noDefault()

    .name("text").type().nullable().bytesType().noDefault()

    .endRecord();

  Configuration conf = new Configuration();

  AvroReadSupport.setRequestedProjection(conf, avroProj);

  ParquetReader<GenericRecord> parquetReader = AvroParquetReader

    .<GenericRecord>builder(new Path(filePath))

    .withConf(conf)

    .build();

  GenericRecord record = parquetReader.read(); // record already have full text 
in RAM

  Long idx = (Long) record.get("idx");

  ByteBuffer rawText = (ByteBuffer) record.get("text");

  String header = new String(rawText.array()).substring(0, 200);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to