[GitHub] carbondata issue #2850: [WIP] Added concurrent reading through SDK

NamanRastogi Sat, 27 Oct 2018 04:04:57 -0700

Github user NamanRastogi commented on the issue:

    https://github.com/apache/carbondata/pull/2850
  
    Please check the split method, it splits the list of `CarbonRecordReader` 
into multiple `CarbonReader`s. It does not jumble the order of 
`CarbonRecordReader`, it still keeps them sequential.
    
    Suppose there are 10 *carbondata* files and thus 10 `CarbonRecordReader` in 
`CarbonReader.readers` object and the user wants to get 3 splits, so he will 
get a list like this:
    ```java
    CarbonReader reader = CarbonReader.builder(dataDir).build();
    List<CarbonReader> multipleReaders = reader.split(3);
    ```
    And the indices of `CarbonRecordReader`s in `multipleReaders` will be like:
    `multipleReaders.get(0).readers` points to {0,1,2,3} indices of 
*carbondata* files
    `multipleReaders.get(1).readers` points to {4,5,6} indices of *carbondata* 
files
    `multipleReaders.get(2).readers` points to {7,8,9} indices of *carbondata* 
files
    
    Now, if you read the rows like following code, the rows will still be in 
order.
    ```java
    for (CarbonReader reader_i : multipleReaders) {
        reader_i.readNextRow();
    }
    ```
    
    Earlier, you were getting data from 5th `CarbonRecordReader` only after you 
have exhausted the 4th. But now, you are getting it earlier, maybe even before 
0th. So the user has to make sure he consumes it after he has used up the 4th 
file if order is important for him/her, otherwise he/she can use it earlier 
also if order is not important. So, for example to count the total no. of rows, 
user does not need the original order.

---

[GitHub] carbondata issue #2850: [WIP] Added concurrent reading through SDK

Reply via email to