Thanks. After reading the discussion in HUDI-561, I just realized that the previously-mentioned built-in partition transformer is better suited to a custom key generator. Hopefully other suitable ideas of built-in transformer would come up later.
On Sun, Feb 23, 2020 at 6:34 PM vino yang <[email protected]> wrote: > Hi Shiyan, > > Really sorry, I forgot to attach the reference, the relevant Jira ID is > HUDI-561: https://issues.apache.org/jira/browse/HUDI-561 > > It seems both of you faced the same issue. While the solution is not the > same. Never mind, you can move the discussion to that issue. > > Best, > Vino > > > Shiyan Xu <[email protected]> 于2020年2月24日周一 上午10:21写道: > > > Thanks Vino. Are you referring to HUDI-613? How about making it an > umbrella > > task due to its big scope? (btw it is stated as "bug", which should be > > fixed too). I can create another specific task under it for the idea of > > datetime -> partition path transformer, if it makes sense. > > > > On Sun, Feb 23, 2020 at 5:57 PM vino yang <[email protected]> wrote: > > > > > Hi Shiyan, > > > > > > Thanks for rasing this thread up again and sharing your thoughts. They > > are > > > valuable. > > > > > > Regarding the date-time specific transform, there is an issue[1] that > > > describes this business requirement. > > > > > > Best, > > > Vino > > > > > > Shiyan Xu <[email protected]> 于2020年2月24日周一 上午7:22写道: > > > > > > > Late to the party. :P > > > > > > > > I really favor the idea of built-in support enrichment. It is a very > > > common > > > > case where we want to set datetime fields for partition path. We > could > > > have > > > > a built-in support to normalize ISO format / unix timestamp. For > > example > > > > `HourlyPartitionTransformer` will normalize whatever field user > > specified > > > > as partition path. Let's say user set `create_ts` as partition path > > > field, > > > > the transfromer will apply change create_ts => _hoodie_partition_path > > > > > > > > > > > > - 2020-02-23T22:41:42.123456789Z => 2020/02/23/22 > > > > - 1582497702.123456789 => 2020/02/23/22 > > > > > > > > Does that make sense? If so, I may file a jira for this. > > > > > > > > As for FilterTransformer or FlatMapTransformer which is designed for > > > > generic purpose, they seem to belong to Spark or Flink's realm. > > > > You can do these 2 transformation with Spark Dataset now. Or once > > > > decoupled from Spark, you'll probably have an abstract Dataset class > > > > to perform engine-agnostic transformation > > > > > > > > My understanding of transformer in HUDI is more specifically > purposed, > > > > where the underlying transformation is handled by the actual > > > > processing engine (Spark or Flink) > > > > > > > > > > > > On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar <[email protected]> > > > wrote: > > > > > > > > > Thanks Hamid and Vinoyang for the great discussion > > > > > > > > > > On Fri, Feb 14, 2020 at 5:18 AM vino yang <[email protected]> > > > wrote: > > > > > > > > > > > I have filed a Jira issue[1] to track this work. > > > > > > > > > > > > [1]: https://issues.apache.org/jira/browse/HUDI-613 > > > > > > > > > > > > vino yang <[email protected]> 于2020年2月13日周四 下午9:51写道: > > > > > > > > > > > > > Hi hamid, > > > > > > > > > > > > > > Agree with your opinion. > > > > > > > > > > > > > > Let's move forward step by step. > > > > > > > > > > > > > > Will file an issue to track refactor about Transformer. > > > > > > > > > > > > > > Best, > > > > > > > Vino > > > > > > > > > > > > > > hamid pirahesh <[email protected]> 于2020年2月13日周四 下午6:38写道: > > > > > > > > > > > > > >> I think it is a good idea to decouple the transformer from > > spark > > > so > > > > > > that > > > > > > >> it can be used with other flow engines. > > > > > > >> Once you do that, then it is worth considering a much bigger > > play > > > > > rather > > > > > > >> than another incremental play. > > > > > > >> Given the scale of Hudi, we need to look at airflow, > > particularly > > > in > > > > > the > > > > > > >> context of what google is doing with Composer, addressing > > > > autoscaling, > > > > > > >> scheduleing, monitoring, etc. > > > > > > >> You need all of that to manage a serious tetl/elt flow. > > > > > > >> > > > > > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang < > [email protected] > > > > > > > > wrote: > > > > > > >> > > > > > > >> > Currently, Hudi has a component that has not been widely > used: > > > > > > >> Transformer. > > > > > > >> > As we all know, before the original data fell into the data > > > lake, > > > > a > > > > > > very > > > > > > >> > common operation is data preprocessing and ETL. This is also > > the > > > > > most > > > > > > >> > common use scenario of many computing engines, such as Flink > > and > > > > > > Spark. > > > > > > >> Now > > > > > > >> > that Hudi has taken advantage of the power of the computing > > > > engine, > > > > > it > > > > > > >> can > > > > > > >> > also naturally take advantage of its ability of data > > > > preprocessing. > > > > > We > > > > > > >> can > > > > > > >> > refactor the Transformer to make it become more flexible. To > > > > > > summarize, > > > > > > >> we > > > > > > >> > can refactor from the following aspects: > > > > > > >> > > > > > > > >> > - Decouple Transformer from Spark > > > > > > >> > - Enrich the Transformer and provide built-in transformer > > > > > > >> > - Support Transformer-chain > > > > > > >> > > > > > > > >> > For the first point, the Transformer interface is tightly > > > coupled > > > > > with > > > > > > >> > Spark in design, and it contains a Spark-specific context. > > This > > > > > makes > > > > > > it > > > > > > >> > impossible for us to take advantage of the transform > > > capabilities > > > > > > >> provided > > > > > > >> > by other engines (such as Flink) after supporting multiple > > > > engines. > > > > > > >> > Therefore, we need to decouple it from Spark in design. > > > > > > >> > > > > > > > >> > For the second point, we can enhance the Transformer and > > provide > > > > > some > > > > > > >> > out-of-the-box Transformers, such as FilterTransformer, > > > > > > >> FlatMapTrnasformer, > > > > > > >> > and so on. > > > > > > >> > > > > > > > >> > For the third point, the most common pattern for data > > processing > > > > is > > > > > > the > > > > > > >> > pipeline model, and the common implementation of the > pipeline > > > > model > > > > > is > > > > > > >> the > > > > > > >> > responsibility chain model, which can be compared to the > > Apache > > > > > > commons > > > > > > >> > chain[1], combining multiple Transformers can make > > > data-processing > > > > > > >> become > > > > > > >> > more flexible and expandable. > > > > > > >> > > > > > > > >> > If we enhance the capabilities of Transformer components, > Hudi > > > > will > > > > > > >> provide > > > > > > >> > richer data processing capabilities based on the computing > > > engine. > > > > > > >> > > > > > > > >> > What do you think? > > > > > > >> > > > > > > > >> > Any opinions and feedback are welcome and appreciated. > > > > > > >> > > > > > > > >> > Best, > > > > > > >> > Vino > > > > > > >> > > > > > > > >> > [1]: https://commons.apache.org/proper/commons-chain/ > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
