Re: About github issue 639

Jun Zhu Sat, 27 Apr 2019 19:02:49 -0700

Thanks for explanation vinoth, code was same list in
https://github.com/apache/incubator-hudi/issues/639, with setting table
format to `.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,
DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)`.
And the result data was stored on aws s3.
I will try more on
`spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);`  from the phenomenon, the
config did not take effects maybe.


On Sat, Apr 27, 2019 at 12:09 AM Vinoth Chandar <[email protected]> wrote:

> Hi,
>
> >>The duplicates was found in inflight commit parquet files. Wondering if
> this was expected?
> Spark shell should not even be reading in-flight parquet files. Can you
> double check if the spark access is properly configured?
> http://hudi.apache.org/querying_data.html#spark
>
> Inflight should be rolled back at the start of the next commit/delta
> commit.. Not sure why there are so many inflight delta commits.
> If you can give a reproducible case, happy to debug it more..
>
> Only complete instants are archived.. So yes, inflight is not archived..
>
> Hope that helps
>
> Thanks
> Vinoth
>
> On Fri, Apr 26, 2019 at 2:09 AM Jun Zhu <[email protected]>
> wrote:
>
> > Hi Vinoth,
> > Some continue question about this thread.
> > Here is what I found after running a few days:
> > in .hoodie folder, due to retain policy maybe, there is an obviously
> > line(list in the end of email). Before it the cleaned commit was
> archived,
> > find duplication when query inflight commit correspond partition by
> > spark-shell. After the line, all behave normal, global dedup works.
> > The duplicates was found in inflight commit parquet files. Wondering if
> > this was expected?
> > Q:
> > 1.  The inflight commit should be turned to roll back status in next
> > writes. Is it normal that so many inflight commit did not make it? Or
> can I
> > config a retain policy to turn inflight to roll_back in another way?
> > 2. Did commit retain policy do not archive inflight commit?
> >
> > 2019-04-23 20:23:47        378 20190423122339.deltacommit.inflight
> >
> > 2019-04-23 20:43:53        378 20190423124343.deltacommit.inflight
> >
> > 2019-04-23 22:14:04        378 20190423141354.deltacommit.inflight
> >
> > 2019-04-23 22:44:09        378 20190423144400.deltacommit.inflight
> >
> > 2019-04-23 22:54:18        378 20190423145408.deltacommit.inflight
> >
> > 2019-04-23 23:04:09        378 20190423150400.deltacommit.inflight
> >
> > 2019-04-23 23:24:30        378 20190423152421.deltacommit.inflight
> >
> > *2019-04-23 23:44:34        378 20190423154424.deltacommit.inflight*
> >
> > *2019-04-24 00:15:46       2991 20190423161431.clean*
> >
> > 2019-04-24 00:15:21     870536 20190423161431.deltacommit
> >
> > 2019-04-24 00:25:19       2991 20190423162424.clean
> >
> > 2019-04-24 00:25:09     875825 20190423162424.deltacommit
> >
> > 2019-04-24 00:35:26       2991 20190423163429.clean
> >
> > 2019-04-24 00:35:18     881925 20190423163429.deltacommit
> >
> > 2019-04-24 00:46:14       2991 20190423164428.clean
> >
> > 2019-04-24 00:45:44     888025 20190423164428.deltacommit
> >
> > Thanks,
> > Jun
> >
> > On 2019/04/18 14:29:23, Vinoth Chandar <[email protected]> wrote:
> > > Hi Jun,>
> > >
> > > Responses below.>
> > >
> > > >>1. Some file inflight may never reach commit?>
> > > yes. the next attempt at writing will first issue a rollback to clean
> up>
> > > such partial/leftover files first, before it begins the new commit.>
> > >
> > > >>2. In occasion which inflight and parquet file generated by inflight
> > still>
> > > exists,  the global dedup will not dedup based on such kind file?>
> > > even if not rolled back, we check for the inflight parquet files
> against>
> > > the committed timeline, which it wont be a part of. So should be safe.>
> > >
> > >
> > > >>3. In occasion which inflight and parquet file generated by inflight
> > still>
> > > exists,  the correct query result will be decided by read config(I>
> > > mean mapreduce.input.pathFilter.class>
> > > in sparksql)>
> > > yes. the filtering should work as well. its the same technique used by>
> > > writer.>
> > >
> > >
> > > >>4. Is there any way we can use>
> > >
> > > >>
> > >
> >
> >
> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",>
> >
> > > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],>
> > > > classOf[org.apache.hadoop.fs.PathFilter]);>
> > >
> > > in spark thrift server when start it?>
> > >
> > > I am not familiar with the Spark thrift server myself. Any pointers
> where
> > I>
> > > can learn more?>
> > > Two suggestions :>
> > > - You can check if you can add this to the Hadoop configuration xml
> > files>
> > > and see if it gets picked up by Spark?>
> > > - Alternatively, you can set the spark config mentioned here>
> > > http://hudi.apache.org/querying_data.html#spark-rt-view (works for ro
> > view>
> > > also), which should be doable I am assuming at this thrift server>
> > >
> > >
> > > Thanks>
> > > Vinoth>
> > >
> > >
> > > On Wed, Apr 17, 2019 at 12:08 AM Jun Zhu <[email protected]>
> > wrote:>
> > >
> > > > Hi,>
> > > > Link: https://github.com/apache/incubator-hudi/issues/639>
> > > > Sorry , failed open
> > https://lists.apache.org/[email protected]>
> > > > .>
> > > > I have some follow up questions for issue 639:>
> > > >>
> > > > So, the sequence of events is . We write parquet files and then upon>
> > > > > successful writing of all attempted parquet files, we actually make
> > the>
> > > > > commit as completed. (i.e not inflight anymore). So this is normal.
> > This>
> > > > is>
> > > > > done to prevent queries from reading partially written parquet
> > files..>
> > > > >>
> > > >>
> > > > Does that mean:>
> > > > 1. Some file inflight may never reach commit?>
> > > > 2. In occasion which inflight and parquet file generated by inflight
> > still>
> > > > exists,  the global dedup will not dedup based on such kind file?>
> > > > 3. In occasion which inflight and parquet file generated by inflight
> > still>
> > > > exists,  the correct query result will be decided by read config(I>
> > > > mean mapreduce.input.pathFilter.class>
> > > > in sparksql)>
> > > > 4. Is there any way we can use>
> > > >>
> > > > >>
> > > >
> >
> >
> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",>
> >
> > > > > classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],>
> > > > > classOf[org.apache.hadoop.fs.PathFilter]);>
> > > >>
> > > > in spark thrift server when start it?>
> > > >>
> > > > Best,>
> > > > -->
> > > > [image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun
> Zhu*>
> > > > Sr. Engineer I, Data>
> > > > ＋86 18565739171>
> > > >>
> > > > [image: in1552694272.png] <https://www.linkedin.com/company/vungle>>
> > > > [image:>
> > > > fb1552694203.png] <https://facebook.com/vungle>      [image:>
> > > > tw1552694330.png] <https://twitter.com/vungle>      [image:>
> > > > ig1552694392.png] <https://www.instagram.com/vungle>>
> > > > Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing,
> China>
> > > >>
> > >
> >
>


-- 
[image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
Sr. Engineer I, Data
＋86 18565739171

[image: in1552694272.png] <https://www.linkedin.com/company/vungle>    [image:
fb1552694203.png] <https://facebook.com/vungle>      [image:
tw1552694330.png] <https://twitter.com/vungle>      [image:
ig1552694392.png] <https://www.instagram.com/vungle>
Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China

Re: About github issue 639

Reply via email to