hi Joris, The Apache Parquet mailing list is
d...@parquet.apache.org I'm copying the list here AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or otherwise). There are some vectorized Java-based readers in the wild: in Dremio [1] and Apache Spark, at least. I'm interested to see a reusable library that supports vectorized Arrow reads in Java. - Wes [1]: https://github.com/dremio/dremio-oss On Thu, May 23, 2019 at 8:54 AM Joris Peeters <joris.mg.peet...@gmail.com> wrote: > > Hello, > > I'm trying to read a Parquet file from disk into Arrow in memory, in Scala. > I'm wondering what the most efficient approach is, especially for the > reading part. I'm aware that Parquet reading is perhaps beyond the scope of > this mailing list but, > > - I believe Arrow and Parquet are closely intertwined these days? > - I can't find an appropriate Parquet mailing list. > > Any pointers would be appreciated! > > Below is the code I currently have. My concern is that this alone already > takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes > ~=100ms in Python. So I suspect I'm not doing this in the most efficient > way possible ... The Parquet data holds 1570150 rows, with 14 columns of > various types, and takes 15MB on disk. > > import org.apache.hadoop.conf.Configuration > import org.apache.parquet.column.ColumnDescriptor > import org.apache.parquet.example.data.simple.convert.GroupRecordConverter > import org.apache.parquet.format.converter.ParquetMetadataConverter > import org.apache.parquet.hadoop.{ParquetFileReader} > import org.apache.parquet.io.ColumnIOFactory > > ... > > val path: Path = Paths.get("C:\\item.pq") > val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath) > val conf = new Configuration() > > val readFooter = ParquetFileReader.readFooter(conf, jpath, > ParquetMetadataConverter.NO_FILTER) > val schema = readFooter.getFileMetaData.getSchema > val r = ParquetFileReader.open(conf, jpath) > > val pages = r.readNextRowGroup() > val rows = pages.getRowCount > > val columnIO = new ColumnIOFactory().getColumnIO(schema) > val recordReader = columnIO.getRecordReader(pages, new > GroupRecordConverter(schema)) > > // This takes about 2s > (1 to rows.toInt).map { i => > val group = recordReader.read > // Just read first column for now ... > val x = group.getLong(0,0) > } > > ... > > As this will be in the hot path of my code, I'm quite keen to make it > as fast as possible. Note that the eventual objective is to build > Arrow data. I was assuming there would be a way to quickly load the > columns. I suspect the loop over the rows, building row-based records, > is causing a lot of overhead, but can't seem to find another way. > > > Thanks, > > -J