Podling Report Reminder - November 2016
Dear podling, This email was sent by an automated system on behalf of the Apache Incubator PMC. It is an initial reminder to give you plenty of time to prepare your quarterly board report. The board meeting is scheduled for Wed, 16 November 2016, 10:30 am PDT. The report for your podling will form a part of the Incubator PMC report. The Incubator PMC requires your report to be submitted 2 weeks before the board meeting, to allow sufficient time for review and submission (Wed, November 02). Please submit your report with sufficient time to allow the Incubator PMC, and subsequently board members to review and digest. Again, the very latest you should submit your report is 2 weeks prior to the board meeting. Thanks, The Apache Incubator PMC Submitting your Report -- Your report should contain the following: * Your project name * A brief description of your project, which assumes no knowledge of the project or necessarily of its field * A list of the three most important issues to address in the move towards graduation. * Any issues that the Incubator PMC or ASF Board might wish/need to be aware of * How has the community developed since the last report * How has the project developed since the last report. This should be appended to the Incubator Wiki page at: http://wiki.apache.org/incubator/November2016 Note: This is manually populated. You may need to wait a little before this page is created from a template. Mentors --- Mentors should review reports for their project(s) and sign them off on the Incubator wiki page. Signing off reports shows that you are following the project - projects that are not signed may raise alarms for the Incubator PMC. Incubator PMC
[GitHub] incubator-carbondata pull request #262: [CARBONDATA-308] Use CarbonInputForm...
Github user QiangCai commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/262#discussion_r86058166 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonInputSplit.java --- @@ -22,28 +22,44 @@ import java.io.DataOutput; import java.io.IOException; import java.io.Serializable; +import java.util.ArrayList; +import java.util.List; + +import org.apache.carbondata.core.carbon.datastore.block.BlockletInfos; +import org.apache.carbondata.core.carbon.datastore.block.Distributable; +import org.apache.carbondata.core.carbon.datastore.block.TableBlockInfo; +import org.apache.carbondata.core.carbon.path.CarbonTablePath; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapreduce.lib.input.FileSplit; + /** * Carbon input split to allow distributed read of CarbonInputFormat. */ -public class CarbonInputSplit extends FileSplit implements Serializable, Writable { +public class CarbonInputSplit extends FileSplit implements Distributable, Serializable, Writable { private static final long serialVersionUID = 3520344046772190207L; private String segmentId; - /** + public String taskId = "0"; + + /* * Number of BlockLets in a block */ private int numberOfBlocklets = 0; - public CarbonInputSplit() { -super(null, 0, 0, new String[0]); + public CarbonInputSplit() { } - public CarbonInputSplit(String segmentId, Path path, long start, long length, + private void parserPath(Path path) { --- End diff -- please use CarbonTablePath.DataFileUtil.getTaskNo --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #262: [CARBONDATA-308] Use CarbonInputForm...
Github user QiangCai commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/262#discussion_r86058188 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonInputSplit.java --- @@ -22,28 +22,44 @@ import java.io.DataOutput; import java.io.IOException; import java.io.Serializable; +import java.util.ArrayList; +import java.util.List; + +import org.apache.carbondata.core.carbon.datastore.block.BlockletInfos; +import org.apache.carbondata.core.carbon.datastore.block.Distributable; +import org.apache.carbondata.core.carbon.datastore.block.TableBlockInfo; +import org.apache.carbondata.core.carbon.path.CarbonTablePath; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapreduce.lib.input.FileSplit; + /** * Carbon input split to allow distributed read of CarbonInputFormat. */ -public class CarbonInputSplit extends FileSplit implements Serializable, Writable { +public class CarbonInputSplit extends FileSplit implements Distributable, Serializable, Writable { private static final long serialVersionUID = 3520344046772190207L; private String segmentId; - /** + public String taskId = "0"; + + /* * Number of BlockLets in a block */ private int numberOfBlocklets = 0; - public CarbonInputSplit() { -super(null, 0, 0, new String[0]); + public CarbonInputSplit() { } - public CarbonInputSplit(String segmentId, Path path, long start, long length, + private void parserPath(Path path) { +String[] nameParts = path.getName().split("-"); +if (nameParts != null && nameParts.length >= 3) { + this.taskId = nameParts[2]; +} + } + + private CarbonInputSplit(String segmentId, Path path, long start, long length, --- End diff -- please initialize taskId --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #272: [CARBONDATA-353]Update doc for datef...
Github user lion-x commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/272#discussion_r86058866 --- Diff: docs/DML-Operations-on-Carbon.md --- @@ -91,12 +91,17 @@ Following are the options that can be used in load data: ```ruby OPTIONS('ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary') ``` -- **COLUMNDICT:** dictionary file path for single column. +- **COLUMNDICT:** Dictionary file path for each column. ```ruby OPTIONS('COLUMNDICT'='column1:dictionaryFilePath1, column2:dictionaryFilePath2') ``` Note: ALL_DICTIONARY_PATH and COLUMNDICT can't be used together. +- **DATEFORMAT:** Date format for each column. + +```ruby +OPTIONS('DATEFORMAT'='column1:dateFormat1, column2:dateFormat2') --- End diff -- I add a note, ref to the JAVA SimpleDateFormat Class Doc. It provides more details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #272: [CARBONDATA-353]Update doc for datef...
Github user lion-x commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/272#discussion_r86058765 --- Diff: docs/DML-Operations-on-Carbon.md --- @@ -91,12 +91,17 @@ Following are the options that can be used in load data: ```ruby OPTIONS('ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary') ``` -- **COLUMNDICT:** dictionary file path for single column. +- **COLUMNDICT:** Dictionary file path for each column. ```ruby OPTIONS('COLUMNDICT'='column1:dictionaryFilePath1, column2:dictionaryFilePath2') ``` Note: ALL_DICTIONARY_PATH and COLUMNDICT can't be used together. +- **DATEFORMAT:** Date format for each column. --- End diff -- updated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #272: [CARBONDATA-353]Update doc for datef...
Github user lion-x commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/272#discussion_r86058742 --- Diff: docs/DML-Operations-on-Carbon.md --- @@ -91,12 +91,17 @@ Following are the options that can be used in load data: ```ruby OPTIONS('ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary') ``` -- **COLUMNDICT:** dictionary file path for single column. +- **COLUMNDICT:** Dictionary file path for each column. --- End diff -- updated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #274: [CARBONDATA-355] Remove unnecessary ...
GitHub user Hexiaoqiao opened a pull request: https://github.com/apache/incubator-carbondata/pull/274 [CARBONDATA-355] Remove unnecessary method argument columnIdentifier of PathService.getCarbonTablePath It is not necessary pass argument #columnIdentifier when get table path through PathService.getCarbonTablePath. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Hexiaoqiao/incubator-carbondata carbon-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/274.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #274 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (CARBONDATA-355) Remove unnecessary method argument columnIdentifier of PathService.getCarbonTablePath
He Xiaoqiao created CARBONDATA-355: -- Summary: Remove unnecessary method argument columnIdentifier of PathService.getCarbonTablePath Key: CARBONDATA-355 URL: https://issues.apache.org/jira/browse/CARBONDATA-355 Project: CarbonData Issue Type: Improvement Components: core Affects Versions: 0.2.0-incubating Reporter: He Xiaoqiao Assignee: He Xiaoqiao Priority: Minor Remove one of method arguments of PathService#getCarbonTablePath since it is not necessary pass columnIdentifier when get table path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [Discussion] Please vote and comment for carbon data file format change
* Hi Xiaoqiao He*, Please find the *attachment.* *-Regards* *Kumar Vishal* On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao Hewrote: > Hi Kumar Vishal, > > I couldn't get Fig. of the file format, could you re-upload them? > Thanks. > > Best Regards > > On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal > wrote: > > > > > Hello All, > > > > Improving carbon first time query performance > > > > Reason: > > 1. As file system cache is cleared file reading will make it slower to > > read and cache > > 2. In first time query carbon will have to read the footer from file data > > file to form the btree > > 3. Carbon reading more footer data than its required(data chunk) > > 4. There are lots of random seek is happening in carbon as column > > data(data page, rle, inverted index) are not stored together. > > > > Solution: > > 1. Improve block loading time. This can be done by removing data chunk > > from blockletInfo and storing only offset and length of data chunk > > 2. compress presence meta bitset stored for null values for measure > column > > using snappy > > 3. Store the metadata and data of a column together and read together > this > > reduces random seek and improve IO > > > > For this I am planing to change the carbondata thrift format > > > > *Old format* > > > > > > > > *New format* > > > > > > > > ** > > > > Please vote and comment for this new format change > > > > -Regards > > Kumar Vishal > > > > > > > > >
[GitHub] incubator-carbondata pull request #272: [CARBONDATA-353]Update doc for datef...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/272#discussion_r85958878 --- Diff: docs/DML-Operations-on-Carbon.md --- @@ -91,12 +91,17 @@ Following are the options that can be used in load data: ```ruby OPTIONS('ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary') ``` -- **COLUMNDICT:** dictionary file path for single column. +- **COLUMNDICT:** Dictionary file path for each column. --- End diff -- change `each` to `specified` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [Discussion] Please vote and comment for carbon data file format change
Hi Kumar Vishal, I couldn't get Fig. of the file format, could you re-upload them? Thanks. Best Regards On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishalwrote: > > Hello All, > > Improving carbon first time query performance > > Reason: > 1. As file system cache is cleared file reading will make it slower to > read and cache > 2. In first time query carbon will have to read the footer from file data > file to form the btree > 3. Carbon reading more footer data than its required(data chunk) > 4. There are lots of random seek is happening in carbon as column > data(data page, rle, inverted index) are not stored together. > > Solution: > 1. Improve block loading time. This can be done by removing data chunk > from blockletInfo and storing only offset and length of data chunk > 2. compress presence meta bitset stored for null values for measure column > using snappy > 3. Store the metadata and data of a column together and read together this > reduces random seek and improve IO > > For this I am planing to change the carbondata thrift format > > *Old format* > > > > *New format* > > > > ** > > Please vote and comment for this new format change > > -Regards > Kumar Vishal > > > >
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim property fo...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r85957010 --- Diff: processing/src/main/java/org/apache/carbondata/processing/graphgenerator/GraphGenerator.java --- @@ -998,4 +1000,24 @@ private void prepareIsUseInvertedIndex(List dims, graphConfig.setIsUseInvertedIndex( isUseInvertedIndexList.toArray(new Boolean[isUseInvertedIndexList.size()])); } + + /** + * Preparing the boolean [] to map whether the dimension use trim or not. + * + * @param dims + * @param graphConfig + */ + private void prepareIsUseTrim(List dims, +GraphConfigurationInfo graphConfig) { +List isUseTrimList = new ArrayList(); +for (CarbonDimension dimension : dims) { + if (dimension.isUseTrim()) { --- End diff -- Can we add this trim option as property of the column, i mean inside a column property rather than setting directly as CarbonDimension property, Already in CarbonColumn there is a property map which holds column related properties, i think we can use that, please check the feasibility. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim property fo...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r85957803 --- Diff: processing/src/main/java/org/apache/carbondata/processing/surrogatekeysgenerator/csvbased/CarbonCSVBasedSeqGenStep.java --- @@ -472,6 +475,7 @@ public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws K break; } } +<<< HEAD --- End diff -- is this file is having any conflict? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #200: [CARBONDATA-276]add trim property fo...
Github user sujith71955 commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/200#discussion_r85957411 --- Diff: processing/src/main/java/org/apache/carbondata/processing/surrogatekeysgenerator/csvbased/CarbonCSVBasedSeqGenMeta.java --- @@ -1694,5 +1699,19 @@ public void setTableOption(String tableOption) { public TableOptionWrapper getTableOptionWrapper() { return tableOptionWrapper; } + + public String getIsUseTrim() { +return isUseTrim; + } + + public void setIsUseTrim(Boolean[] isUseTrim) { +for (Boolean flag: isUseTrim) { + if (flag) { +this.isUseTrim += "T"; --- End diff -- Use TRUE/FALSE for better readability --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #263: [CARBONDATA-2][WIP] Data load integr...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/263#discussion_r85923089 --- Diff: integration/spark/src/main/java/org/apache/carbondata/spark/load/CarbonLoaderUtil.java --- @@ -213,6 +224,64 @@ public static void executeGraph(CarbonLoadModel loadModel, String storeLocation, info, loadModel.getPartitionId(), loadModel.getCarbonDataLoadSchema()); } + public static void executeNewDataLoad(CarbonLoadModel loadModel, String storeLocation, + String hdfsStoreLocation, RecordReader[] recordReaders) + throws Exception { +if (!new File(storeLocation).mkdirs()) { + LOGGER.error("Error while creating the temp store path: " + storeLocation); +} +CarbonDataLoadConfiguration configuration = new CarbonDataLoadConfiguration(); +String databaseName = loadModel.getDatabaseName(); +String tableName = loadModel.getTableName(); +String tempLocationKey = databaseName + CarbonCommonConstants.UNDERSCORE + tableName ++ CarbonCommonConstants.UNDERSCORE + loadModel.getTaskNo(); +CarbonProperties.getInstance().addProperty(tempLocationKey, storeLocation); +CarbonProperties.getInstance() +.addProperty(CarbonCommonConstants.STORE_LOCATION_HDFS, hdfsStoreLocation); +// CarbonProperties.getInstance().addProperty("store_output_location", outPutLoc); +CarbonProperties.getInstance().addProperty("send.signal.load", "false"); + +CarbonTable carbonTable = loadModel.getCarbonDataLoadSchema().getCarbonTable(); +AbsoluteTableIdentifier identifier = +carbonTable.getAbsoluteTableIdentifier(); +configuration.setTableIdentifier(identifier); +String csvHeader = loadModel.getCsvHeader(); +if (csvHeader != null && !csvHeader.isEmpty()) { + configuration.setHeader(CarbonDataProcessorUtil.getColumnFields(csvHeader, ",")); +} else { + CarbonFile csvFile = + CarbonDataProcessorUtil.getCsvFileToRead(loadModel.getFactFilesToProcess().get(0)); + configuration + .setHeader(CarbonDataProcessorUtil.getFileHeader(csvFile, loadModel.getCsvDelimiter())); +} + +configuration.setPartitionId(loadModel.getPartitionId()); +configuration.setSegmentId(loadModel.getSegmentId()); +configuration.setTaskNo(loadModel.getTaskNo()); + configuration.setDataLoadProperty(DataLoadProcessorConstants.COMPLEX_DELIMITERS, +new String[] { loadModel.getComplexDelimiterLevel1(), +loadModel.getComplexDelimiterLevel2() }); +List dimensions = + carbonTable.getDimensionByTableName(carbonTable.getFactTableName()); +List measures = +carbonTable.getMeasureByTableName(carbonTable.getFactTableName()); +DataField[] dataFields = new DataField[dimensions.size() + measures.size()]; + +int i = 0; +for (CarbonColumn column : dimensions) { + dataFields[i++] = new DataField(column); +} +for (CarbonColumn column : measures) { + dataFields[i++] = new DataField(column); +} +Iterator[] iterators = new RecordReaderIterator[recordReaders.length]; +configuration.setDataFields(dataFields); +for (int j = 0; j < recordReaders.length; j++) { + iterators[j] = new RecordReaderIterator(recordReaders[j]); +} +new DataLoadProcessExecutor().execute(configuration, iterators); --- End diff -- yes, we can use but here it is little customized api to take list of iterators/recordreaders to process parallely. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[Discussion] Please vote and comment for carbon data file format change
Hello All, Improving carbon first time query performance Reason: 1. As file system cache is cleared file reading will make it slower to read and cache 2. In first time query carbon will have to read the footer from file data file to form the btree 3. Carbon reading more footer data than its required(data chunk) 4. There are lots of random seek is happening in carbon as column data(data page, rle, inverted index) are not stored together. Solution: 1. Improve block loading time. This can be done by removing data chunk from blockletInfo and storing only offset and length of data chunk 2. compress presence meta bitset stored for null values for measure column using snappy 3. Store the metadata and data of a column together and read together this reduces random seek and improve IO For this I am planing to change the carbondata thrift format *Old format* *New format* ** Please vote and comment for this new format change -Regards Kumar Vishal
[jira] [Created] (CARBONDATA-354) Query execute successfully even not argument given in count function
Prabhat Kashyap created CARBONDATA-354: -- Summary: Query execute successfully even not argument given in count function Key: CARBONDATA-354 URL: https://issues.apache.org/jira/browse/CARBONDATA-354 Project: CarbonData Issue Type: Bug Reporter: Prabhat Kashyap Priority: Minor When I am executing following command: select count() from tableName; It gave me no error and execute successfully but it gives following exception when I execute the same in Hive: FAILED: UDFArgumentException Argument expected -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [ANN] Kumar Vishal as new CarbonData committer
Dear Jean, Thanks for this opportunity. -Regards Kumar Vishal On Nov 1, 2016 11:53, "Jean-Baptiste Onofré"wrote: > Hi all, > > I'm pleased to announce that the PPMC has invited Kumar Vishal as new > CarbonData committer, and the invite has been accepted ! > > Congrats to Kumar and welcome aboard. > > Regards > JB > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >
Multiple arguments in count function.
I am able to use multiple column name and even asterisk with columns in count function. Even though it is giving me correct output but I think it should not be allowed. As the same queries give error when executed in HIVE. Please confirm. -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Multiple-arguments-in-count-function-tp2487.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
Re: [ANN] Kumar Vishal as new CarbonData committer
Hi Kumar Vishal Congrats to you and welcome aboard! Regards Liang 2016-11-01 14:23 GMT+08:00 Jean-Baptiste Onofré: > Hi all, > > I'm pleased to announce that the PPMC has invited Kumar Vishal as new > CarbonData committer, and the invite has been accepted ! > > Congrats to Kumar and welcome aboard. > > Regards > JB > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com > -- Regards Liang
Re: No error on executing 'IN' operator without specifying any columnname
When ever I'm trying to run following command in HIVE: it executes normally in HIVE, but the same query gives the following error in CARBONDATA : Please confirm is it Carbon Data bug or Spark one. -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/No-error-on-executing-IN-operator-without-specifying-any-column-name-tp2424p2485.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
[ANN] Kumar Vishal as new CarbonData committer
Hi all, I'm pleased to announce that the PPMC has invited Kumar Vishal as new CarbonData committer, and the invite has been accepted ! Congrats to Kumar and welcome aboard. Regards JB -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com