Hi,

I completely agree on using private and package-protected visibility wherever 
it is possible. Let me explain an actual example where it is not possible.
For the column index implementation I’ve created interfaces and builders for 
ColumnIndex and OffsetIndex. I’ve put them to the module parquet-column. These 
classes have to be visible for the metadata converter in parquet-hadoop and 
e.g. for the CLI to make it visible for debugging purposes. They also have to 
be used by ParquetFileWriter and Reader as well for filtering. So, they have to 
be visible in many different places of the code base. However, I would not 
expose them to the user as they work automatically (just like the classes 
around statistics).

Some words about the Yetus InterfaceAudience annotations. I don’t have any 
experience on using these. After a quick search it seems we can skip the 
javadoc generation for “private” classes. I don’t know if other interesting 
tooling is available but it certainly has some benefits comparing to simple 
comments.
To summarize the concept of the annotations:
Public: public to the user; As we already exposed everything, I think, this is 
the default. However, other projects state that no annotation means private so 
I am not sure.
LimitiedPrivate: only public to specified projects (e.g. Hadoop); I don’t think 
we need it
Private: for internal use only; If we would like to categorize existing code 
this one is the most useful for us.
In addition, these annotations can be also used on methods not only on classes, 
so we can mark “private” on the current API without any refactor.

My proposal is to use the internal packaging for new implementations (only in 
cases where it is really necessary). It would also help restructuring the API 
for 2.0. In addition, we can use the Yetus InterfaceAudience annotations for 
the existing API as a separate effort. It would help to see what parts of the 
API should and should not be used by the user. (Currently, it makes really hard 
to extend/change the API in feature developments.) If we start using both ways 
it would be a good idea to also annotate the *.internal*.classes with @Private 
for consistency.

If we agree on using the internal packaging it would be great to also agree on 
the pattern. My proposal is to use org.apache.parquet.internal.** 
(org.apache.parquet.internal.column.columnindex for my previous example).

What do you guys think?

Cheers,
Gabor


> On 10 May 2018, at 21:39, Ryan Blue <[email protected]> wrote:
> 
> I think we should first try to use private and package-protected visibility
> in Java first, and only use annotations or internal packages if we need to
> expose internals across packages for some reason.
> 
> For the yetus annotations, what is the benefit of using these? Is there
> some integration with javadoc tools? What are the annotations that you're
> proposing to use and what do they mean?
> 
> rb
> 
> On Thu, May 10, 2018 at 6:44 AM, Gabor Szadovszky <
> [email protected]> wrote:
> 
>> Hi Everyone,
>> 
>> Unfortunately, I was not able to participate on the last Parquet sync
>> where you discussed how to separate the java public code parts that are
>> public for internal use only or public to be exposed to the user. I am
>> currently adding some new code related to the column indexes and would like
>> to ask about your opinions.
>> Use the Yetus InterfaceAudience annotations to mark the public/private
>> classes
>> pros:
>> backward compatible (if we want to categorise the already existing API, it
>> can be done by using this)
>> already used by some of the Hadoop components
>> cons:
>> not sure if it is much better than a simple comment; don’t know any
>> tooling that checks this
>> even if we mark some existing API private we cannot remove/modify them
>> otherwise we break the compatibility
>> Use internal package naming convention for the new classes
>> pros:
>> obvious to the user
>> future proof (if we would like to use the java9 module feature in
>> parquet-mr 2.x)
>> cons:
>> backward incompatible (we are not able to categorise the already existing
>> API)
>> can be ignored; I don’t know any way for java8 to enforce it
>> 
>> If we would like to mark the internal API for the existing code I would
>> vote on using the Yetus annotations while for the new code I think the
>> internal packages is a better choice. What do you think?
>> 
>> For the internal packages it is another question how exactly we would like
>> to implement it. For example, I am adding some new classes related to
>> column indexes into the package org.apache.parquet.column.columnindex.
>> All of these classes are for internal use only. Where should I put
>> “internal”? I would vote on using 
>> org.apache.parquet.internal.column.columnindex
>> so later all modules would have the org.apache.parquet.internal structure
>> for internal use and all the others are public. What do you think?
>> 
>> Thanks a lot,
>> Gabor
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Reply via email to