Hi Nicolas,

Have you tried increasing the maximum Java heap size?

https://stackoverflow.com/a/15517399/5613485

Br,

Zoltan

On Wed, Aug 29, 2018 at 8:39 PM Nicolas Troncoso <[email protected]> wrote:

> > I'm clearly not understanding something its a 960MB file. Even if it go
> fully loaded into memory it should have more than enough to do the
> processing.
>
> I think the culprit of my problems is the size of the rowGroup. I'm trying
> to get the people who generate these files to make the rowGrops size
> smaller.  These files are only for transporting data from one place to
> another and not intended for colunar manipulation so there is not much gain
> in having massive rowGroups.
>
> In that subject: is if possible to  to read the rowGroup by parts to avoid
> having the whole thing in memory?
>
>
> On Sun, Aug 26, 2018 at 9:10 PM Nicolas Troncoso <[email protected]>
> wrote:
>
> > Hi,
> > I'm loading a parque file 900:
> >
> > maeve:$ parquet-tools rowcount -d
> > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet
> > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet row
> > count: 77318
> > Total RowCount: 77318
> >
> > maeve:$ parquet-tools size -d
> > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet
> > part-00000-6e54bc7e-bb79-423d-ae6a-d46ea55b591b-c000.snappy.parquet:
> > 1009297251 bytes
> > Total Size: 1009297251 bytes
> >
> > with the following java code snippet.
> >
> > import org.apache.parquet.column.page.PageReadStore;
> > import org.apache.parquet.example.data.Group;
> > import
> org.apache.parquet.example.data.simple.convert.GroupRecordConverter;
> > import org.apache.parquet.hadoop.ParquetFileReader;
> > import org.apache.parquet.io.ColumnIOFactory;
> > import org.apache.parquet.io.MessageColumnIO;
> > import org.apache.parquet.io.RecordReader;
> > import org.apache.parquet.schema.MessageType;
> >
> > public long importProcessParquetFile(ParquetFileReader parquetReader,
> > LongRunningTaskTracker tracker)
> > throws IOException{
> > long records = 0;
> > PageReadStore pages = null;
> > List<D> batch = new ArrayList<>();
> > MessageType schema =
> > parquetReader.getFooter().getFileMetaData().getSchema();
> > logger.warn("Got Schema");
> > while(null != (pages = parquetReader.readNextRowGroup())){
> > MessageColumnIO columnIo = new ColumnIOFactory().getColumnIO(schema);
> > logger.warn("Got columnIo");
> > RecordReader<Group> recordReader = columnIo.getRecordReader(pages, new
> > GroupRecordConverter(schema));
> > //^^^^^^ this line causes the OOM kill on the production environment.
> > logger.warn("Got recordReader");
> > for(int i = 0; i < pages.getRowCount(); i++){
> > D bean = parseBean(recordReader.read());
> > if(bean == null){
> > logger.warn("Could not create bean while importing Zillow Region Master",
> > new Throwable());
> > continue;
> > }
> > batch.add(bean);
> > if(batch.size() >= getBatchSize()){
> > records += storeAndClearBatch(batch);
> > tracker.heartbeat(records);
> > }
> > }
> > }
> > records += storeAndClearBatch(batch);
> > tracker.heartbeat(records);
> > return records;
> > }
> >
> > The production environment has 16GB ram no swap.
> >
> > I'm clearly not understanding something its a 960MB file. Even if it go
> > fully loaded into memory it should have more than enough to do the
> > processing.
> >
> > If i run it on my dev machine with a swap file i can run it to
> completion.
> > I'm trying to understand why the memory footprint gets so big, and if
> there
> > is a more efficient way to read the file.
> > Maybe there is a more efficient way to create the file?
> >
> > the file was created with parquet-mr 1.8, the file is being read with the
> > parquet-hadoop 1.9
> >
> > cheers.
> >
> >
>

Reply via email to