Re: parquet-mr exposing public API

Zoltan Ivanfi Tue, 15 May 2018 06:41:11 -0700

Hi,

I think we should formalize this shortly for discussion. Based on your
approach, I would suggest the following rules:


_New methods/classes for internal use:_

1. When possible, new internal methods/classes must have private or
package private visibility.
2. When #1 is not possible, new internal methods/classes should be
placed in an internal package under org.apache.parquet.internal.**
3. When #2 would require a major refactoring or result in a breaking
change, new internal methods must be annotated with @Private instead.

_New methods/classes for 3rd party use:_

New public methods/classes outside of internal packages should be
annotated with @Public. This serves the following purposes:
  - It differentiates public methods/classes that were added using
this policy (and therefore are truly intended for 3rd party use) from
public methods added earlier (that may or may not be intended for 3rd
party use).
  - It will make it possible to check compliance to this policy in
code reviews by making it clear that the intended visibility of a
newly added public method/class was at least considered.
  - It will ease refactoring for Parquet 2.0 (at which point the
annotations can be removed).

Please let me know your opinions.

Thanks,

Zoltan

On Tue, May 15, 2018 at 1:29 PM Gabor Szadovszky
<[email protected]> wrote:
>
> Hi Everyone,
>
> I’ve made a change on top of my development related to the column indexes: 
> https://github.com/apache/parquet-mr/pull/456/commits/77123ff69cdd78f449f1b2996f37618a9c66e256
>  
> <https://github.com/apache/parquet-mr/pull/456/commits/77123ff69cdd78f449f1b2996f37618a9c66e256>
> Could you please check and comment about my proposal?
>
> Cheers,
> Gabor
>
> > On 11 May 2018, at 07:12, Gabor Szadovszky <[email protected]> 
> > wrote:
> >
> > Hi,
> >
> > I completely agree on using private and package-protected visibility 
> > wherever it is possible. Let me explain an actual example where it is not 
> > possible.
> > For the column index implementation I’ve created interfaces and builders 
> > for ColumnIndex and OffsetIndex. I’ve put them to the module 
> > parquet-column. These classes have to be visible for the metadata converter 
> > in parquet-hadoop and e.g. for the CLI to make it visible for debugging 
> > purposes. They also have to be used by ParquetFileWriter and Reader as well 
> > for filtering. So, they have to be visible in many different places of the 
> > code base. However, I would not expose them to the user as they work 
> > automatically (just like the classes around statistics).
> >
> > Some words about the Yetus InterfaceAudience annotations. I don’t have any 
> > experience on using these. After a quick search it seems we can skip the 
> > javadoc generation for “private” classes. I don’t know if other interesting 
> > tooling is available but it certainly has some benefits comparing to simple 
> > comments.
> > To summarize the concept of the annotations:
> > Public: public to the user; As we already exposed everything, I think, this 
> > is the default. However, other projects state that no annotation means 
> > private so I am not sure.
> > LimitiedPrivate: only public to specified projects (e.g. Hadoop); I don’t 
> > think we need it
> > Private: for internal use only; If we would like to categorize existing 
> > code this one is the most useful for us.
> > In addition, these annotations can be also used on methods not only on 
> > classes, so we can mark “private” on the current API without any refactor.
> >
> > My proposal is to use the internal packaging for new implementations (only 
> > in cases where it is really necessary). It would also help restructuring 
> > the API for 2.0. In addition, we can use the Yetus InterfaceAudience 
> > annotations for the existing API as a separate effort. It would help to see 
> > what parts of the API should and should not be used by the user. 
> > (Currently, it makes really hard to extend/change the API in feature 
> > developments.) If we start using both ways it would be a good idea to also 
> > annotate the *.internal*.classes with @Private for consistency.
> >
> > If we agree on using the internal packaging it would be great to also agree 
> > on the pattern. My proposal is to use org.apache.parquet.internal.** 
> > (org.apache.parquet.internal.column.columnindex for my previous example).
> >
> > What do you guys think?
> >
> > Cheers,
> > Gabor
> >
> >
> >> On 10 May 2018, at 21:39, Ryan Blue <[email protected]> wrote:
> >>
> >> I think we should first try to use private and package-protected visibility
> >> in Java first, and only use annotations or internal packages if we need to
> >> expose internals across packages for some reason.
> >>
> >> For the yetus annotations, what is the benefit of using these? Is there
> >> some integration with javadoc tools? What are the annotations that you're
> >> proposing to use and what do they mean?
> >>
> >> rb
> >>
> >> On Thu, May 10, 2018 at 6:44 AM, Gabor Szadovszky <
> >> [email protected]> wrote:
> >>
> >>> Hi Everyone,
> >>>
> >>> Unfortunately, I was not able to participate on the last Parquet sync
> >>> where you discussed how to separate the java public code parts that are
> >>> public for internal use only or public to be exposed to the user. I am
> >>> currently adding some new code related to the column indexes and would 
> >>> like
> >>> to ask about your opinions.
> >>> Use the Yetus InterfaceAudience annotations to mark the public/private
> >>> classes
> >>> pros:
> >>> backward compatible (if we want to categorise the already existing API, it
> >>> can be done by using this)
> >>> already used by some of the Hadoop components
> >>> cons:
> >>> not sure if it is much better than a simple comment; don’t know any
> >>> tooling that checks this
> >>> even if we mark some existing API private we cannot remove/modify them
> >>> otherwise we break the compatibility
> >>> Use internal package naming convention for the new classes
> >>> pros:
> >>> obvious to the user
> >>> future proof (if we would like to use the java9 module feature in
> >>> parquet-mr 2.x)
> >>> cons:
> >>> backward incompatible (we are not able to categorise the already existing
> >>> API)
> >>> can be ignored; I don’t know any way for java8 to enforce it
> >>>
> >>> If we would like to mark the internal API for the existing code I would
> >>> vote on using the Yetus annotations while for the new code I think the
> >>> internal packages is a better choice. What do you think?
> >>>
> >>> For the internal packages it is another question how exactly we would like
> >>> to implement it. For example, I am adding some new classes related to
> >>> column indexes into the package org.apache.parquet.column.columnindex.
> >>> All of these classes are for internal use only. Where should I put
> >>> “internal”? I would vote on using 
> >>> org.apache.parquet.internal.column.columnindex
> >>> so later all modules would have the org.apache.parquet.internal structure
> >>> for internal use and all the others are public. What do you think?
> >>>
> >>> Thanks a lot,
> >>> Gabor
> >>
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >
>

Re: parquet-mr exposing public API

Reply via email to