Hi Kabeer,

You are right. HUDI-1 alone wont be sufficient. We need to do a follow on.
IIRC this is already planned work (balaji?)
Filed https://issues.apache.org/jira/browse/HUDI-80 to separate this from
HUDI-1..

On to the issue you are facing, seems like the connections to S3 keep
hanging around? Don't think cleaning actually opens any files, simply lists
and deletes. We could call a fs.close which probably shuts the connections
down. But, need to think through that more, since fs caching is a tricky
issue..  https://issues.apache.org/jira/browse/HUDI-81  filed this
separately to track this. If you can help me track the connections to S3
etc, I can take a stab and may be we can test teh patch in your
environment?

We can work on the ticket. Please share your jira id, so I can add you as
acontributor, giving you commenting etc on jira

Thanks
Vinoth



On Sun, Mar 24, 2019 at 2:11 PM Kabeer Ahmed <[email protected]> wrote:

> Hi Vinoth,
>
> Thank you for your response. I thought of reducing clear parallelism which
> is Min(200, table_partitions). But it wouldnt have an effect as regardless
> of parallelism, there will be an attempt to scan all files (reduced
> parallelism might albeit slow the process).
> So as stated in a table with 750+ partitions I did notice that connections
> would increase and I have now been forced to keep the S3 connection limit
> to 5k due to this issue.
> I also looked into the brief description of the jira:
> https://issues.apache.org/jira/browse/HUDI-1 (
> https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-1&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
> This is a very nice optimisation to have but I dont think it will help
> alleviate the concerns on the S3. On HDFS, this jira will definitely help
> reduce the # of name node connections but S3 objects will need to be opened
> to clear them and the problem will no go away.
> I think the effective work has to be on the lines of working up cleaning
> the partitions in the routine below:
>
> // File: HoodieCopyOnWriteTable.java
> public List<HoodieCleanStat> clean(JavaSparkContext jsc) {
>
> try {
> FileSystem fs = getMetaClient().getFs();
> List<String> partitionsToClean = FSUtils
> .getAllPartitionPaths(fs, getMetaClient().getBasePath(),
> config.shouldAssumeDatePartitioning());
> logger.info("Partitions to clean up : " + partitionsToClean + ", with
> policy " + config
> .getCleanerPolicy());
> if (partitionsToClean.isEmpty()) {
> logger.info("Nothing to clean here mom. It is already clean");
> return Collections.emptyList();
> }
> return cleanPartitionPaths(partitionsToClean, jsc);
> } catch (IOException e) {
> throw new HoodieIOException("Failed to clean up after commit", e);
> }
> }
> In the above routine, all the connections opened are not closed. I think
> the work should be on the lines of cleaning the connections in this routine
> after the cleaning operation (i.e. file close logic added so that it is
> executed in parallel for every file opened by Spark executors).
>
> Please feel free to correct me if you think I have goofed up somewhere.
> Thanks
> Kabeer.
>
> PS: There is so much going on and there is a need to progress with the
> stuff at hand at work. Otherwise would have loved to spend time and send a
> PR.
> On Mar 24 2019, at 7:04 am, Vinoth Chandar <[email protected]> wrote:
> > Hi Kabeer,
> >
> > No need to apologize :)
> > Mailing list works lot better for reporting issues. We can respond much
> > quicker, since its not buried with all other github events
> >
> > On what you saw, the cleaner does list all partitions currently. Have you
> > tried reducing cleaner parallelism if limiting connections is your goal?
> >
> > Also some good news is, once
> > https://issues.apache.org/jira/browse/HUDI-1 is landed (currently being
> > reviewed), a follow on is to rework the cleaner incrementally on top
> which
> > should help a lot here.
> >
> >
> > Thanks
> > Vinoth
> >
> > On Sat, Mar 23, 2019 at 7:39 PM Kabeer Ahmed <[email protected]>
> wrote:
> > > Hi,
> > > I have just raised this issue and thought to share with the community
> if
> > > someone else is experiencing this. Apologies in advance if this is a
> > > redundant email.
> > > Thanks
> > > Kabeer.
> >
> >
>
>

Reply via email to