fx19880617 opened a new pull request #4742: Adding bootstrap mode for 
Pinot-hadoop job to output segments into relative directories.
URL: https://github.com/apache/incubator-pinot/pull/4742
 
 
   - Skip hidden files or temp files created by computation frameworks like 
hadoop, spark.
   - Adding a `job.bootstrap` flag to make output directory following the 
relative paths from input path.
   
   **job.properties**
   ```
   input.dir = /path/to/input
   output.dir = /path/to/output
   job.bootstrap=true
   segment.table.name=mytable
   ```
   The data structure under `/path/to/input` is like:
   ```
   /path/to/input/yyyy=2019/mm=10/dd=1/part-0-r-aaa.avro
   /path/to/input/yyyy=2019/mm=10/dd=2/part-0-r-bbb.avro
   /path/to/input/yyyy=2019/mm=10/dd=3/part-0-r-ccc.avro
   ```
   
   We expect the output directory structure to be:
   ```
   /path/to/output/yyyy=2019/mm=10/dd=1/mytable_0.tar.gz
   /path/to/output/yyyy=2019/mm=10/dd=2/mytable_1.tar.gz
   /path/to/output/yyyy=2019/mm=10/dd=3/mytable_2.tar.gz
   ```
   In the old job, we will get:
   ```
   /path/to/output/mytable_0.tar.gz
   /path/to/output/mytable_1.tar.gz
   /path/to/output/mytable_2.tar.gz
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to