GitHub user sounakr opened a pull request:
https://github.com/apache/carbondata/pull/2032
[CARBONDATA-2224] External File level reader support
File level reader reads any carbondata file placed in any external file
path. The reading can be done through 3 methods.
a) Reading as a datasource from Spark. CarbonFileLevelFormat.scala is used
in this case to read the file. To create a spark datasource external table
" CREATE TABLE sdkOutputTable **USING CarbonDataFileFormat** LOCATION
'$writerOutputFilePath1'"
For more details please refer the test file
org/apache/carbondata/spark/testsuite/createTable/TestCreateTableUsingCarbonFileLevelFormat.scala
file.
b) Reading from spark sql as a external table. CarbonFileinputFormat.java
is used for reading the files. The create table syntax for this will be
"CREATE EXTERNAL TABLE sdkOutputTable **STORED BY 'carbondatafileformat'**
LOCATION '$writerOutputFilePath6'"
For more details
org/apache/carbondata/spark/testsuite/createTable/TestCarbonFileInputFormatWithExternalCarbonTable.scala.
c) Reading Through Hadoop Map reduce job. Please refer
org/apache/carbondata/mapred/TestMapReduceCarbonFileInputFormat.java for more
details.
- [ ] Any interfaces changed?
- [ ] Any backward compatibility impacted?
- [ ] Document update required?
- [ ] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests
are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance
test report.
- Any additional information to help reviewers in testing this
change.
- [ ] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sounakr/incubator-carbondata file_level_reader
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/2032.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2032
----
commit 65ce23b1f6e35c3c6722c7f0c14c19b7c8536d23
Author: Jacky Li <jacky.likun@...>
Date: 2018-01-06T12:28:44Z
[CARBONDATA-1992] Remove partitionId in CarbonTablePath
In CarbonTablePath, there is a deprecated partition id which is always 0,
it should be removed to avoid confusion.
This closes #1765
commit c9ceaaae66574c98a13cc65bc3b91ab8346a456b
Author: Jacky Li <jacky.likun@...>
Date: 2018-01-30T13:24:04Z
[CARBONDATA-2099] Refactor query scan process to improve readability
Unified concepts in scan process flow:
1.QueryModel contains all parameter for scan, it is created by API in
CarbonTable. (In future, CarbonTable will be the entry point for various table
operations)
2.Use term ColumnChunk to represent one column in one blocklet, and use
ChunkIndex in reader to read specified column chunk
3.Use term ColumnPage to represent one page in one ColumnChunk
4.QueryColumn => ProjectionColumn, indicating it is for projection
This closes #1874
commit 01fcd539af815956975eb4ea480f14e4bb1a2062
Author: ravipesala <ravi.pesala@...>
Date: 2017-11-15T14:18:40Z
[CARBONDATA-1544][Datamap] Datamap FineGrain implementation
Implemented interfaces for FG datamap and integrated to filterscanner to
use the pruned bitset from FG datamap.
FG Query flow as follows.
1.The user can add FG datamap to any table and implement there interfaces.
2. Any filter query which hits the table with datamap will call prune
method of FGdatamap.
3. The prune method of FGDatamap return list FineGrainBlocklet , these
blocklets contain the information of block, blocklet, page and rowids
information as well.
4. The pruned blocklets are internally wriitten to file and returns only
the block , blocklet and filepath information as part of Splits.
5. Based on the splits scanrdd schedule the tasks.
6. In filterscanner we check the datamapwriterpath from split and reNoteads
the bitset if exists. And pass this bitset as input to it.
This closes #1471
commit da82cdbda4f45fa741f56594e23c61a575c2fd2c
Author: Jacky Li <jacky.likun@...>
Date: 2018-02-27T00:51:25Z
[REBASE] resolve conflict after rebasing to master
commit 072c95a6770a2b847e111f3349df271bade62675
Author: Jacky Li <jacky.likun@...>
Date: 2018-02-10T02:34:59Z
Revert "[CARBONDATA-2023][DataLoad] Add size base block allocation in data
loading"
This reverts commit 6dd8b038fc898dbf48ad30adfc870c19eb38e3d0.
commit 50af4d91ca2415d12e559b6070f72bfe5a881641
Author: Jacky Li <jacky.likun@...>
Date: 2018-02-11T13:37:04Z
[CARBONDATA-2159] Remove carbon-spark dependency in store-sdk module
To make assembling JAR of store-sdk module, it should not depend on
carbon-spark module
This closes #1970
commit e77fcac978a87d9d526ea7012954fc8e48e9e34c
Author: xuchuanyin <xuchuanyin@...>
Date: 2018-02-08T06:42:39Z
[CARBONDATA-2023][DataLoad] Add size base block allocation in data loading
Carbondata assign blocks to nodes at the beginning of data loading.
Previous block allocation strategy is block number based and it will
suffer skewed data problem if the size of input files differs a lot.
We introduced a size based block allocation strategy to optimize data
loading performance in skewed data scenario.
This closes #1808
commit 00e5208a6da5cc13aabd3ed6c437d2d1c5fa06ff
Author: sounakr <sounakr@...>
Date: 2017-09-28T10:51:05Z
[CARBONDATA-1480]Min Max Index Example for DataMap
Datamap Example. Implementation of Min Max Index through Datamap. And Using
the Index while prunning.
This closes #1359
commit 3212c0c025191c754c454ad88de3adbec26dc58b
Author: ravipesala <ravi.pesala@...>
Date: 2017-11-15T14:18:40Z
[CARBONDATA-1544][Datamap] Datamap FineGrain implementation
Implemented interfaces for FG datamap and integrated to filterscanner to
use the pruned bitset from FG datamap.
FG Query flow as follows.
1.The user can add FG datamap to any table and implement there interfaces.
2. Any filter query which hits the table with datamap will call prune
method of FGdatamap.
3. The prune method of FGDatamap return list FineGrainBlocklet , these
blocklets contain the information of block, blocklet, page and rowids
information as well.
4. The pruned blocklets are internally wriitten to file and returns only
the block , blocklet and filepath information as part of Splits.
5. Based on the splits scanrdd schedule the tasks.
6. In filterscanner we check the datamapwriterpath from split and reNoteads
the bitset if exists. And pass this bitset as input to it.
This closes #1471
commit aa3f2ff731fa6e0004dea827417c0d932d4a6291
Author: Jacky Li <jacky.likun@...>
Date: 2018-01-06T12:28:44Z
[CARBONDATA-1992] Remove partitionId in CarbonTablePath
In CarbonTablePath, there is a deprecated partition id which is always 0,
it should be removed to avoid confusion.
This closes #1765
commit 3ba31a162dc66bc5ee9023c7ff466c7de4c31c50
Author: Jacky Li <jacky.likun@...>
Date: 2018-01-30T13:24:04Z
[CARBONDATA-2099] Refactor query scan process to improve readability
Unified concepts in scan process flow:
1.QueryModel contains all parameter for scan, it is created by API in
CarbonTable. (In future, CarbonTable will be the entry point for various table
operations)
2.Use term ColumnChunk to represent one column in one blocklet, and use
ChunkIndex in reader to read specified column chunk
3.Use term ColumnPage to represent one page in one ColumnChunk
4.QueryColumn => ProjectionColumn, indicating it is for projection
This closes #1874
commit 810f093c28dc9e8a70a04bef1bc701569ec4261e
Author: Jacky Li <jacky.likun@...>
Date: 2018-01-31T08:14:27Z
[CARBONDATA-2025] Unify all path construction through CarbonTablePath
static method
Refactory CarbonTablePath:
1.Remove CarbonStorePath and use CarbonTablePath only.
2.Make CarbonTablePath an utility without object creation, it can avoid
creating object before using it, thus code is cleaner and GC is less.
This closes #1768
commit 5a91a4cf49e3554f95f88637d93b51c80bf5329f
Author: xuchuanyin <xuchuanyin@...>
Date: 2018-02-08T06:42:39Z
[CARBONDATA-2023][DataLoad] Add size base block allocation in data loading
Carbondata assign blocks to nodes at the beginning of data loading.
Previous block allocation strategy is block number based and it will
suffer skewed data problem if the size of input files differs a lot.
We introduced a size based block allocation strategy to optimize data
loading performance in skewed data scenario.
This closes #1808
commit 667303e7dfa515cda7cd3e34c736b74b5e246c29
Author: xuchuanyin <xuchuanyin@...>
Date: 2018-02-08T07:39:45Z
[HotFix][CheckStyle] Fix import related checkstyle
This closes #1952
commit 442350f6cbc908ea02ec6ef5f8d5b748b63d73d9
Author: Jacky Li <jacky.likun@...>
Date: 2018-02-27T03:26:30Z
[REBASE] Solve conflict after merging master
commit ea51dbf0d0d03d5cf9a946594cec61e4d9a2a46d
Author: Jacky Li <jacky.likun@...>
Date: 2018-02-10T02:34:59Z
Revert "[CARBONDATA-2023][DataLoad] Add size base block allocation in data
loading"
This reverts commit 6dd8b038fc898dbf48ad30adfc870c19eb38e3d0.
commit d13f01bfb7bf84fd8a231300219cbc4818eabe5b
Author: sounakr <sounakr@...>
Date: 2018-02-24T02:25:14Z
File Format Reader
commit 06b0c74edbc6097ada28382f27c54905a1b07159
Author: sounakr <sounakr@...>
Date: 2018-02-26T11:58:47Z
File Format Phase 2
commit 372b380470600c03a2f723b53a106a5ce0087ae9
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T06:06:56Z
* File Format Phase 2 (cleanup code)
commit 8eb20a5dd9543029239a051bd978e855a69d805c
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T06:36:28Z
* File Format Phase 2 (cleanup code)
commit 462fd28cbc1268bbb529f947ee2e93c068e0d682
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T09:54:43Z
* File Format Phase 2 (cleanup code and adding testCase)
commit 952688b8cf1b17954b85af6143abcab77d081da8
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T11:58:37Z
* File Format Phase 2 (filter issue fix)
commit 87c84943122c8523291cc25751829ac143161469
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T12:20:46Z
* File Format Phase 2 (filter issue fix return value)
commit 3a0c3b9448c3cca0742db0f557518ffa12d0dabb
Author: sounakr <sounakr@...>
Date: 2018-02-27T13:55:16Z
Clear DataMap Cache
commit 1943cf6dcd266cd78483f137e0499083d95e4332
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T14:02:35Z
* File Format Phase 2 (test cases)
commit 4f97c7e35fade5fe0abb58b0c781a6b7f5b744e9
Author: sounakr <sounakr@...>
Date: 2018-02-28T03:18:45Z
Refactor CarbonFileInputFormat
commit 7df78cf50b658cc6fb79e28b0ad76f74dc8a680a
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-28T10:02:08Z
* File Format Phase 2
a. test cases addition
b. Exception handling when the files are not present
c. Setting the filter expression in carbonTableInputFormat
commit 4825fcc8d023c2b1a031ee0417addf5b6f2d5763
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-28T10:02:08Z
* File Format Phase 2
a. test cases addition
b. Exception handling when the files are not present
c. Setting the filter expression in carbonTableInputFormat
commit 5e5adbe21b8b786c13fda13e7e052bc5e46f22b4
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-28T10:02:08Z
* File Format Phase 2
a. test cases addition
b. Exception handling when the files are not present
c. Setting the filter expression in carbonTableInputFormat
commit b510faa9e033fb2ca0ae64125aee10709201e69f
Author: sounakr <sounakr@...>
Date: 2018-03-01T11:23:39Z
Map Reduce Test Case for CarbonInputFileFormat
----
---