Hi Everyone, I’ve made a change on top of my development related to the column indexes: https://github.com/apache/parquet-mr/pull/456/commits/77123ff69cdd78f449f1b2996f37618a9c66e256 <https://github.com/apache/parquet-mr/pull/456/commits/77123ff69cdd78f449f1b2996f37618a9c66e256> Could you please check and comment about my proposal?
Cheers, Gabor > On 11 May 2018, at 07:12, Gabor Szadovszky <[email protected]> > wrote: > > Hi, > > I completely agree on using private and package-protected visibility wherever > it is possible. Let me explain an actual example where it is not possible. > For the column index implementation I’ve created interfaces and builders for > ColumnIndex and OffsetIndex. I’ve put them to the module parquet-column. > These classes have to be visible for the metadata converter in parquet-hadoop > and e.g. for the CLI to make it visible for debugging purposes. They also > have to be used by ParquetFileWriter and Reader as well for filtering. So, > they have to be visible in many different places of the code base. However, I > would not expose them to the user as they work automatically (just like the > classes around statistics). > > Some words about the Yetus InterfaceAudience annotations. I don’t have any > experience on using these. After a quick search it seems we can skip the > javadoc generation for “private” classes. I don’t know if other interesting > tooling is available but it certainly has some benefits comparing to simple > comments. > To summarize the concept of the annotations: > Public: public to the user; As we already exposed everything, I think, this > is the default. However, other projects state that no annotation means > private so I am not sure. > LimitiedPrivate: only public to specified projects (e.g. Hadoop); I don’t > think we need it > Private: for internal use only; If we would like to categorize existing code > this one is the most useful for us. > In addition, these annotations can be also used on methods not only on > classes, so we can mark “private” on the current API without any refactor. > > My proposal is to use the internal packaging for new implementations (only in > cases where it is really necessary). It would also help restructuring the API > for 2.0. In addition, we can use the Yetus InterfaceAudience annotations for > the existing API as a separate effort. It would help to see what parts of the > API should and should not be used by the user. (Currently, it makes really > hard to extend/change the API in feature developments.) If we start using > both ways it would be a good idea to also annotate the *.internal*.classes > with @Private for consistency. > > If we agree on using the internal packaging it would be great to also agree > on the pattern. My proposal is to use org.apache.parquet.internal.** > (org.apache.parquet.internal.column.columnindex for my previous example). > > What do you guys think? > > Cheers, > Gabor > > >> On 10 May 2018, at 21:39, Ryan Blue <[email protected]> wrote: >> >> I think we should first try to use private and package-protected visibility >> in Java first, and only use annotations or internal packages if we need to >> expose internals across packages for some reason. >> >> For the yetus annotations, what is the benefit of using these? Is there >> some integration with javadoc tools? What are the annotations that you're >> proposing to use and what do they mean? >> >> rb >> >> On Thu, May 10, 2018 at 6:44 AM, Gabor Szadovszky < >> [email protected]> wrote: >> >>> Hi Everyone, >>> >>> Unfortunately, I was not able to participate on the last Parquet sync >>> where you discussed how to separate the java public code parts that are >>> public for internal use only or public to be exposed to the user. I am >>> currently adding some new code related to the column indexes and would like >>> to ask about your opinions. >>> Use the Yetus InterfaceAudience annotations to mark the public/private >>> classes >>> pros: >>> backward compatible (if we want to categorise the already existing API, it >>> can be done by using this) >>> already used by some of the Hadoop components >>> cons: >>> not sure if it is much better than a simple comment; don’t know any >>> tooling that checks this >>> even if we mark some existing API private we cannot remove/modify them >>> otherwise we break the compatibility >>> Use internal package naming convention for the new classes >>> pros: >>> obvious to the user >>> future proof (if we would like to use the java9 module feature in >>> parquet-mr 2.x) >>> cons: >>> backward incompatible (we are not able to categorise the already existing >>> API) >>> can be ignored; I don’t know any way for java8 to enforce it >>> >>> If we would like to mark the internal API for the existing code I would >>> vote on using the Yetus annotations while for the new code I think the >>> internal packages is a better choice. What do you think? >>> >>> For the internal packages it is another question how exactly we would like >>> to implement it. For example, I am adding some new classes related to >>> column indexes into the package org.apache.parquet.column.columnindex. >>> All of these classes are for internal use only. Where should I put >>> “internal”? I would vote on using >>> org.apache.parquet.internal.column.columnindex >>> so later all modules would have the org.apache.parquet.internal structure >>> for internal use and all the others are public. What do you think? >>> >>> Thanks a lot, >>> Gabor >> >> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >
