Re: parquet-mr exposing public API

Gabor Szadovszky Tue, 15 May 2018 04:30:08 -0700

Hi Everyone,

I’ve made a change on top of my development related to the column indexes: 
https://github.com/apache/parquet-mr/pull/456/commits/77123ff69cdd78f449f1b2996f37618a9c66e256
 
<https://github.com/apache/parquet-mr/pull/456/commits/77123ff69cdd78f449f1b2996f37618a9c66e256>
Could you please check and comment about my proposal?


Cheers,
Gabor

> On 11 May 2018, at 07:12, Gabor Szadovszky <[email protected]> 
> wrote:
> 
> Hi,
> 
> I completely agree on using private and package-protected visibility wherever 
> it is possible. Let me explain an actual example where it is not possible.
> For the column index implementation I’ve created interfaces and builders for 
> ColumnIndex and OffsetIndex. I’ve put them to the module parquet-column. 
> These classes have to be visible for the metadata converter in parquet-hadoop 
> and e.g. for the CLI to make it visible for debugging purposes. They also 
> have to be used by ParquetFileWriter and Reader as well for filtering. So, 
> they have to be visible in many different places of the code base. However, I 
> would not expose them to the user as they work automatically (just like the 
> classes around statistics).
> 
> Some words about the Yetus InterfaceAudience annotations. I don’t have any 
> experience on using these. After a quick search it seems we can skip the 
> javadoc generation for “private” classes. I don’t know if other interesting 
> tooling is available but it certainly has some benefits comparing to simple 
> comments.
> To summarize the concept of the annotations:
> Public: public to the user; As we already exposed everything, I think, this 
> is the default. However, other projects state that no annotation means 
> private so I am not sure.
> LimitiedPrivate: only public to specified projects (e.g. Hadoop); I don’t 
> think we need it
> Private: for internal use only; If we would like to categorize existing code 
> this one is the most useful for us.
> In addition, these annotations can be also used on methods not only on 
> classes, so we can mark “private” on the current API without any refactor.
> 
> My proposal is to use the internal packaging for new implementations (only in 
> cases where it is really necessary). It would also help restructuring the API 
> for 2.0. In addition, we can use the Yetus InterfaceAudience annotations for 
> the existing API as a separate effort. It would help to see what parts of the 
> API should and should not be used by the user. (Currently, it makes really 
> hard to extend/change the API in feature developments.) If we start using 
> both ways it would be a good idea to also annotate the *.internal*.classes 
> with @Private for consistency.
> 
> If we agree on using the internal packaging it would be great to also agree 
> on the pattern. My proposal is to use org.apache.parquet.internal.** 
> (org.apache.parquet.internal.column.columnindex for my previous example).
> 
> What do you guys think?
> 
> Cheers,
> Gabor
> 
> 
>> On 10 May 2018, at 21:39, Ryan Blue <[email protected]> wrote:
>> 
>> I think we should first try to use private and package-protected visibility
>> in Java first, and only use annotations or internal packages if we need to
>> expose internals across packages for some reason.
>> 
>> For the yetus annotations, what is the benefit of using these? Is there
>> some integration with javadoc tools? What are the annotations that you're
>> proposing to use and what do they mean?
>> 
>> rb
>> 
>> On Thu, May 10, 2018 at 6:44 AM, Gabor Szadovszky <
>> [email protected]> wrote:
>> 
>>> Hi Everyone,
>>> 
>>> Unfortunately, I was not able to participate on the last Parquet sync
>>> where you discussed how to separate the java public code parts that are
>>> public for internal use only or public to be exposed to the user. I am
>>> currently adding some new code related to the column indexes and would like
>>> to ask about your opinions.
>>> Use the Yetus InterfaceAudience annotations to mark the public/private
>>> classes
>>> pros:
>>> backward compatible (if we want to categorise the already existing API, it
>>> can be done by using this)
>>> already used by some of the Hadoop components
>>> cons:
>>> not sure if it is much better than a simple comment; don’t know any
>>> tooling that checks this
>>> even if we mark some existing API private we cannot remove/modify them
>>> otherwise we break the compatibility
>>> Use internal package naming convention for the new classes
>>> pros:
>>> obvious to the user
>>> future proof (if we would like to use the java9 module feature in
>>> parquet-mr 2.x)
>>> cons:
>>> backward incompatible (we are not able to categorise the already existing
>>> API)
>>> can be ignored; I don’t know any way for java8 to enforce it
>>> 
>>> If we would like to mark the internal API for the existing code I would
>>> vote on using the Yetus annotations while for the new code I think the
>>> internal packages is a better choice. What do you think?
>>> 
>>> For the internal packages it is another question how exactly we would like
>>> to implement it. For example, I am adding some new classes related to
>>> column indexes into the package org.apache.parquet.column.columnindex.
>>> All of these classes are for internal use only. Where should I put
>>> “internal”? I would vote on using 
>>> org.apache.parquet.internal.column.columnindex
>>> so later all modules would have the org.apache.parquet.internal structure
>>> for internal use and all the others are public. What do you think?
>>> 
>>> Thanks a lot,
>>> Gabor
>> 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>

Re: parquet-mr exposing public API

Reply via email to