I've been working with Trino, but I'm not that deep into it yet. I have noticed 
that there's a bunch of Parquet functionality that's duplicated in Trino. This 
may be necessary to work around some of the problems with ParquetMR being able 
or not being able to read the same file depending on which API you try to use. 
Or it may be just that there's functionality that specific to the database that 
isn't appropriate to put into Trino. Not sure. For sure, Trino depends on and 
uses a lot of ParquetMR, but there's loads of it that it doesn't use.

On 4/26/22, 10:19 AM, "Xinyu Zeng" <[email protected]> wrote:


    Thanks Tim and Gamaken! They are helpful links and code.

    A followup(maybe stupid) question: it seems other java based engines
    like Presto has their own implementations of Parquet read/write. Is
    that because parquet-mr can only deserialize Parquet into some
    specific format like avro/thrift/protobuf, but some other engines need
    tight coupling between Parquet and their in memory format? They also
    need some different IO/buffering techniques than parquet-mr. If my
    understanding is correct, does that mean a unified parquet
    implementation does not exist and that is not the purpose of
    parquet-mr?

    Thanks

    On Tue, Apr 26, 2022 at 9:37 PM Miller, Tim <[email protected]> 
wrote:
    >
    > Also, using the API is a pain, because you have to use Hadoop. Various 
people have found work-arounds for this, such as:
    > Comments on: https://issues.apache.org/jira/browse/PARQUET-1822
    >
    > I also assembled a minimal reader myself (from code I found elsewhere on 
github, which I should add attributions for later) which I put here:
    > https://github.com/theosib-amazon/parquet-mr-minreader
    >
    > On 4/25/22, 2:51 PM, "gamaken k" <[email protected]> wrote:
    >
    >     CAUTION: This email originated from outside of the organization. Do 
not click links or open attachments unless you can confirm the sender and know 
the content is safe.
    >
    >
    >
    >     > wiki on how to use the api
    >     +1 to this. I too think this would be very useful for getting started.
    >     Xinyu, you could potentially look at parquet-cli's source code to
    >     understand how it invokes the various APIs from parquet-mr, I think.
    >
    >     On Sun, Apr 24, 2022 at 8:29 AM Xinyu Zeng <[email protected]> wrote:
    >
    >     > Hi,
    >     >
    >     > I am a previous user of parquet-cpp(now integrated with arrow) and 
now
    >     > I am going to use the java version parquet-mr. However, I did not 
find
    >     > any doc or wiki on how to use the api. I am also interested in
    >     > contributing but there is also no contribution guide like other open
    >     > source projects. I would appreciate it if someone could give me a
    >     > short guide.
    >     >
    >     > Thanks,
    >     > Xinyu
    >     >
    >

Reply via email to