Andrei Iatsuk created PARQUET-1237: -------------------------------------- Summary: Reading big texts cause OutOfMemmory Error. How to read text partialy? Key: PARQUET-1237 URL: https://issues.apache.org/jira/browse/PARQUET-1237 Project: Parquet Issue Type: Bug Components: parquet-avro Affects Versions: 1.8.1 Environment: I have dataset with big strings (every record about 15 mb) in parquet.
When I try to open all parquet parts I get OutOfMemory exception. How can I get only headers (first 100 symbols) for each string record without reading all record? {code:java} Schema avroProj = SchemaBuilder.builder() .record("proj").fields() .name("idx").type().nullable().longType().noDefault() .name("text").type().nullable().bytesType().noDefault() .endRecord(); Configuration conf = new Configuration(); AvroReadSupport.setRequestedProjection(conf, avroProj); ParquetReader<GenericRecord> parquetReader = AvroParquetReader .<GenericRecord>builder(new Path(filePath)) .withConf(conf) .build(); GenericRecord record = parquetReader.read(); // record already have full text in RAM Long idx = (Long) record.get("idx"); ByteBuffer rawText = (ByteBuffer) record.get("text"); String header = new String(rawText.array()).substring(0, 200); {code} Reporter: Andrei Iatsuk I have dataset with big strings (every record about 15 mb) in parquet. When I try to open all parquet parts I get OutOfMemory exception. How can I get only headers (first 100 symbols) for each string record without reading all record? Schema avroProj = SchemaBuilder.builder() .record("proj").fields() .name("idx").type().nullable().longType().noDefault() .name("text").type().nullable().bytesType().noDefault() .endRecord(); Configuration conf = new Configuration(); AvroReadSupport.setRequestedProjection(conf, avroProj); ParquetReader<GenericRecord> parquetReader = AvroParquetReader .<GenericRecord>builder(new Path(filePath)) .withConf(conf) .build(); GenericRecord record = parquetReader.read(); // record already have full text in RAM Long idx = (Long) record.get("idx"); ByteBuffer rawText = (ByteBuffer) record.get("text"); String header = new String(rawText.array()).substring(0, 200); -- This message was sent by Atlassian JIRA (v7.6.3#76005)