didip commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1244769844

   @gianm Sure thing.
   
   All of our customers (data scientists & data engineers) use Spark to massage 
data for Druid to consume. They typically do a number of transformations and at 
the end the push raw parquet files to a specific S3 bucket+path.
   
   Now, these people have varying proficiency in Spark. Some are extremely 
good, some are barely passable. Spark does a lot of things behind the scene and 
they usually involved in creating `_temporary` folders, e.g. during merge or 
shuffling. These are the junk folders I talked about. 
   
   It's incredibly tedious to chase these down and remind my customers to clean 
up. Some are capable to clean up, some completely ignored my requests (but 
later complained when their count data didn't match because the parquet files 
inside `_temporary` folder is accidentally ingested).
   
   As for the confusion with `s3://mybucket/path/to/parquet` confusion, I can 
provide a convenience method to strip them out, that should help reducing 
confusion.
   
   As for filename filter, I am not sure if it's that useful if you have the 
full path glob feature. It might introduce another confusion by having 2 
different filter JSON attributes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to