+1, Incremental cleaning is a scheduled work. I will be working on this immediately after the HUDI-1 Balaji.V On Sunday, March 24, 2019, 7:42:03 PM PDT, Vinoth Chandar <[email protected]> wrote: Hi Kabeer,
You are right. HUDI-1 alone wont be sufficient. We need to do a follow on. IIRC this is already planned work (balaji?) Filed https://issues.apache.org/jira/browse/HUDI-80 to separate this from HUDI-1.. On to the issue you are facing, seems like the connections to S3 keep hanging around? Don't think cleaning actually opens any files, simply lists and deletes. We could call a fs.close which probably shuts the connections down. But, need to think through that more, since fs caching is a tricky issue.. https://issues.apache.org/jira/browse/HUDI-81 filed this separately to track this. If you can help me track the connections to S3 etc, I can take a stab and may be we can test teh patch in your environment? We can work on the ticket. Please share your jira id, so I can add you as acontributor, giving you commenting etc on jira Thanks Vinoth On Sun, Mar 24, 2019 at 2:11 PM Kabeer Ahmed <[email protected]> wrote: > Hi Vinoth, > > Thank you for your response. I thought of reducing clear parallelism which > is Min(200, table_partitions). But it wouldnt have an effect as regardless > of parallelism, there will be an attempt to scan all files (reduced > parallelism might albeit slow the process). > So as stated in a table with 750+ partitions I did notice that connections > would increase and I have now been forced to keep the S3 connection limit > to 5k due to this issue. > I also looked into the brief description of the jira: > https://issues.apache.org/jira/browse/HUDI-1 ( > https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-1&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D). > This is a very nice optimisation to have but I dont think it will help > alleviate the concerns on the S3. On HDFS, this jira will definitely help > reduce the # of name node connections but S3 objects will need to be opened > to clear them and the problem will no go away. > I think the effective work has to be on the lines of working up cleaning > the partitions in the routine below: > > // File: HoodieCopyOnWriteTable.java > public List<HoodieCleanStat> clean(JavaSparkContext jsc) { > > try { > FileSystem fs = getMetaClient().getFs(); > List<String> partitionsToClean = FSUtils > .getAllPartitionPaths(fs, getMetaClient().getBasePath(), > config.shouldAssumeDatePartitioning()); > logger.info("Partitions to clean up : " + partitionsToClean + ", with > policy " + config > .getCleanerPolicy()); > if (partitionsToClean.isEmpty()) { > logger.info("Nothing to clean here mom. It is already clean"); > return Collections.emptyList(); > } > return cleanPartitionPaths(partitionsToClean, jsc); > } catch (IOException e) { > throw new HoodieIOException("Failed to clean up after commit", e); > } > } > In the above routine, all the connections opened are not closed. I think > the work should be on the lines of cleaning the connections in this routine > after the cleaning operation (i.e. file close logic added so that it is > executed in parallel for every file opened by Spark executors). > > Please feel free to correct me if you think I have goofed up somewhere. > Thanks > Kabeer. > > PS: There is so much going on and there is a need to progress with the > stuff at hand at work. Otherwise would have loved to spend time and send a > PR. > On Mar 24 2019, at 7:04 am, Vinoth Chandar <[email protected]> wrote: > > Hi Kabeer, > > > > No need to apologize :) > > Mailing list works lot better for reporting issues. We can respond much > > quicker, since its not buried with all other github events > > > > On what you saw, the cleaner does list all partitions currently. Have you > > tried reducing cleaner parallelism if limiting connections is your goal? > > > > Also some good news is, once > > https://issues.apache.org/jira/browse/HUDI-1 is landed (currently being > > reviewed), a follow on is to rework the cleaner incrementally on top > which > > should help a lot here. > > > > > > Thanks > > Vinoth > > > > On Sat, Mar 23, 2019 at 7:39 PM Kabeer Ahmed <[email protected]> > wrote: > > > Hi, > > > I have just raised this issue and thought to share with the community > if > > > someone else is experiencing this. Apologies in advance if this is a > > > redundant email. > > > Thanks > > > Kabeer. > > > > > >
