Hi Nicolas, Have you tried increasing the maximum Java heap size?
https://stackoverflow.com/a/15517399/5613485 Br, Zoltan On Wed, Aug 29, 2018 at 8:39 PM Nicolas Troncoso <[email protected]> wrote: > > I'm clearly not understanding something its a 960MB file. Even if it go > fully loaded into memory it should have more than enough to do the > processing. > > I think the culprit of my problems is the size of the rowGroup. I'm trying > to get the people who generate these files to make the rowGrops size > smaller. These files are only for transporting data from one place to > another and not intended for colunar manipulation so there is not much gain > in having massive rowGroups. > > In that subject: is if possible to to read the rowGroup by parts to avoid > having the whole thing in memory? > > > On Sun, Aug 26, 2018 at 9:10 PM Nicolas Troncoso <[email protected]> > wrote: > > > Hi, > > I'm loading a parque file 900: > > > > maeve:$ parquet-tools rowcount -d > > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet > > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet row > > count: 77318 > > Total RowCount: 77318 > > > > maeve:$ parquet-tools size -d > > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet > > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet: > > 1009297251 bytes > > Total Size: 1009297251 bytes > > > > with the following java code snippet. > > > > import org.apache.parquet.column.page.PageReadStore; > > import org.apache.parquet.example.data.Group; > > import > org.apache.parquet.example.data.simple.convert.GroupRecordConverter; > > import org.apache.parquet.hadoop.ParquetFileReader; > > import org.apache.parquet.io.ColumnIOFactory; > > import org.apache.parquet.io.MessageColumnIO; > > import org.apache.parquet.io.RecordReader; > > import org.apache.parquet.schema.MessageType; > > > > public long importProcessParquetFile(ParquetFileReader parquetReader, > > LongRunningTaskTracker tracker) > > throws IOException{ > > long records = 0; > > PageReadStore pages = null; > > List<D> batch = new ArrayList<>(); > > MessageType schema = > > parquetReader.getFooter().getFileMetaData().getSchema(); > > logger.warn("Got Schema"); > > while(null != (pages = parquetReader.readNextRowGroup())){ > > MessageColumnIO columnIo = new ColumnIOFactory().getColumnIO(schema); > > logger.warn("Got columnIo"); > > RecordReader<Group> recordReader = columnIo.getRecordReader(pages, new > > GroupRecordConverter(schema)); > > //^^^^^^ this line causes the OOM kill on the production environment. > > logger.warn("Got recordReader"); > > for(int i = 0; i < pages.getRowCount(); i++){ > > D bean = parseBean(recordReader.read()); > > if(bean == null){ > > logger.warn("Could not create bean while importing Zillow Region Master", > > new Throwable()); > > continue; > > } > > batch.add(bean); > > if(batch.size() >= getBatchSize()){ > > records += storeAndClearBatch(batch); > > tracker.heartbeat(records); > > } > > } > > } > > records += storeAndClearBatch(batch); > > tracker.heartbeat(records); > > return records; > > } > > > > The production environment has 16GB ram no swap. > > > > I'm clearly not understanding something its a 960MB file. Even if it go > > fully loaded into memory it should have more than enough to do the > > processing. > > > > If i run it on my dev machine with a swap file i can run it to > completion. > > I'm trying to understand why the memory footprint gets so big, and if > there > > is a more efficient way to read the file. > > Maybe there is a more efficient way to create the file? > > > > the file was created with parquet-mr 1.8, the file is being read with the > > parquet-hadoop 1.9 > > > > cheers. > > > > >
