I've been working with Trino, but I'm not that deep into it yet. I have noticed that there's a bunch of Parquet functionality that's duplicated in Trino. This may be necessary to work around some of the problems with ParquetMR being able or not being able to read the same file depending on which API you try to use. Or it may be just that there's functionality that specific to the database that isn't appropriate to put into Trino. Not sure. For sure, Trino depends on and uses a lot of ParquetMR, but there's loads of it that it doesn't use.
On 4/26/22, 10:19 AM, "Xinyu Zeng" <[email protected]> wrote: Thanks Tim and Gamaken! They are helpful links and code. A followup(maybe stupid) question: it seems other java based engines like Presto has their own implementations of Parquet read/write. Is that because parquet-mr can only deserialize Parquet into some specific format like avro/thrift/protobuf, but some other engines need tight coupling between Parquet and their in memory format? They also need some different IO/buffering techniques than parquet-mr. If my understanding is correct, does that mean a unified parquet implementation does not exist and that is not the purpose of parquet-mr? Thanks On Tue, Apr 26, 2022 at 9:37 PM Miller, Tim <[email protected]> wrote: > > Also, using the API is a pain, because you have to use Hadoop. Various people have found work-arounds for this, such as: > Comments on: https://issues.apache.org/jira/browse/PARQUET-1822 > > I also assembled a minimal reader myself (from code I found elsewhere on github, which I should add attributions for later) which I put here: > https://github.com/theosib-amazon/parquet-mr-minreader > > On 4/25/22, 2:51 PM, "gamaken k" <[email protected]> wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > > wiki on how to use the api > +1 to this. I too think this would be very useful for getting started. > Xinyu, you could potentially look at parquet-cli's source code to > understand how it invokes the various APIs from parquet-mr, I think. > > On Sun, Apr 24, 2022 at 8:29 AM Xinyu Zeng <[email protected]> wrote: > > > Hi, > > > > I am a previous user of parquet-cpp(now integrated with arrow) and now > > I am going to use the java version parquet-mr. However, I did not find > > any doc or wiki on how to use the api. I am also interested in > > contributing but there is also no contribution guide like other open > > source projects. I would appreciate it if someone could give me a > > short guide. > > > > Thanks, > > Xinyu > > >
