> I'm clearly not understanding something its a 960MB file. Even if it go
fully loaded into memory it should have more than enough to do the
processing.

I think the culprit of my problems is the size of the rowGroup. I'm trying
to get the people who generate these files to make the rowGrops size
smaller.  These files are only for transporting data from one place to
another and not intended for colunar manipulation so there is not much gain
in having massive rowGroups.

In that subject: is if possible to  to read the rowGroup by parts to avoid
having the whole thing in memory?


On Sun, Aug 26, 2018 at 9:10 PM Nicolas Troncoso <[email protected]> wrote:

> Hi,
> I'm loading a parque file 900:
>
> maeve:$ parquet-tools rowcount -d
> part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet
> part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet row
> count: 77318
> Total RowCount: 77318
>
> maeve:$ parquet-tools size -d
> part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet
> part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet:
> 1009297251 bytes
> Total Size: 1009297251 bytes
>
> with the following java code snippet.
>
> import org.apache.parquet.column.page.PageReadStore;
> import org.apache.parquet.example.data.Group;
> import org.apache.parquet.example.data.simple.convert.GroupRecordConverter;
> import org.apache.parquet.hadoop.ParquetFileReader;
> import org.apache.parquet.io.ColumnIOFactory;
> import org.apache.parquet.io.MessageColumnIO;
> import org.apache.parquet.io.RecordReader;
> import org.apache.parquet.schema.MessageType;
>
> public long importProcessParquetFile(ParquetFileReader parquetReader,
> LongRunningTaskTracker tracker)
> throws IOException{
> long records = 0;
> PageReadStore pages = null;
> List<D> batch = new ArrayList<>();
> MessageType schema =
> parquetReader.getFooter().getFileMetaData().getSchema();
> logger.warn("Got Schema");
> while(null != (pages = parquetReader.readNextRowGroup())){
> MessageColumnIO columnIo = new ColumnIOFactory().getColumnIO(schema);
> logger.warn("Got columnIo");
> RecordReader<Group> recordReader = columnIo.getRecordReader(pages, new
> GroupRecordConverter(schema));
> //^^^^^^ this line causes the OOM kill on the production environment.
> logger.warn("Got recordReader");
> for(int i = 0; i < pages.getRowCount(); i++){
> D bean = parseBean(recordReader.read());
> if(bean == null){
> logger.warn("Could not create bean while importing Zillow Region Master",
> new Throwable());
> continue;
> }
> batch.add(bean);
> if(batch.size() >= getBatchSize()){
> records += storeAndClearBatch(batch);
> tracker.heartbeat(records);
> }
> }
> }
> records += storeAndClearBatch(batch);
> tracker.heartbeat(records);
> return records;
> }
>
> The production environment has 16GB ram no swap.
>
> I'm clearly not understanding something its a 960MB file. Even if it go
> fully loaded into memory it should have more than enough to do the
> processing.
>
> If i run it on my dev machine with a swap file i can run it to completion.
> I'm trying to understand why the memory footprint gets so big, and if there
> is a more efficient way to read the file.
> Maybe there is a more efficient way to create the file?
>
> the file was created with parquet-mr 1.8, the file is being read with the
> parquet-hadoop 1.9
>
> cheers.
>
>

Reply via email to