[ https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821005#comment-16821005 ]
xubo245 commented on CARBONDATA-3336: ------------------------------------- Array<Binary>:org.apache.carbondata.processing.loading.parser.impl.RowParserImpl#parseRow > Support Binary Data Type > ------------------------ > > Key: CARBONDATA-3336 > URL: https://issues.apache.org/jira/browse/CARBONDATA-3336 > Project: CarbonData > Issue Type: New Feature > Reporter: xubo245 > Assignee: xubo245 > Priority: Major > Attachments: CarbonData support binary data type V0.1.pdf > > Time Spent: 4h 10m > Remaining Estimate: 0h > > CarbonData supports binary data type > Version Changes Owner Date > 0.1 Init doc for Supporting binary data type Xubo 2019-4-10 > Background : > Binary is basic data type and widely used in various scenarios. So it’s > better to support binary data type in CarbonData. Download data from S3 will > be slow when dataset has lots of small binary data. The majority of > application scenarios are related to storage small binary data type into > CarbonData, which can avoid small binary files problem and speed up S3 access > performance, also can decrease cost of accessing OBS by decreasing the number > of calling S3 API. It also will easier to manage structure data and > Unstructured data(binary) by storing them into CarbonData. > Goals: > 1. Supporting write binary data type by Carbon Java SDK. > 2. Supporting read binary data type by Spark Carbon file format(carbon > datasource) and CarbonSession. > 3. Supporting read binary data type by Carbon SDK > 4. Supporting write binary by spark > Approach and Detail: > 1.Supporting write binary data type by Carbon Java SDK [Formal]: > 1.1 Java SDK needs support write data with specific data types, > like int, double, byte[ ] data type, no need to convert all data type to > string array. User read binary file as byte[], then SDK writes byte[] into > binary column. > 1.2 CarbonData compress binary column because now the compressor is > table level. > =>TODO, support configuration for compress, default is no > compress because binary usually is already compressed, like jpg format image. > So no need to uncompress for binary column. 1.5.4 will support column level > compression, after that, we can implement no compress for binary. We can talk > with community. > 1.3 CarbonData stores binary as dimension. > 1.4 Support configure page size for binary data type because binary > data usually is big, such as 200k. Otherwise it will be very big for one > blocklet (32000 rows). > TODO: 1.5 Avro, JSON convert need consider > > 2. Supporting read and manage binary data type by Spark Carbon file > format(carbon DataSource) and CarbonSession.[Formal] > 2.1 Supporting read binary data type from non-transaction table, > read binary column and return as byte[] > 2.2 Support create table with binary column, table property doesn’t > support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary > column > => Evaluate COLUMN_META_CACHE for binary > => CARBON Datasource don't support dictionary include column > => carbon.column.compressor for all columns > 2.3 Support CTAS for binary=> transaction/non-transaction > 2.4 Support external table for binary > 2.5 Support projection for binary column > 2.6 Support desc formatted > => Carbon Datasource don't support ALTER TABLE add > calumny sql > =>TODO: ALTER TABLE for binary data type in carbon session > 2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS for binary > 2.8 Support compaction for binary(TODO) > 2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, > no need min max datamap for binary, support mv and pre-aggregate in the > future > 2.10 CSDK / python SDK support binary in the future.(TODO) > 2.11 Support S3 > TODO: > 2.12 support UDF, hex, base64, cast: > select hex(bin) from carbon_table. > select CAST(s AS BINARY) from carbon_table. > CarbonSession: impact analysis > > 3. Supporting read binary data type by Carbon SDK > 3.1 Supporting read binary data type from non-transaction table, > read binary column and return as byte[] > 3.2 Supporting projection for binary column > 3.3 Supporting S3 > 3.4 no need to support filter. > 4. Supporting write binary by spark (carbon file format / > carbonsession, POC??) > 4.1 Convert binary to String and storage in CSV > 4.2 Spark load CSV and convert string to byte[], and storage in > CarbonData. read binary column and return as byte[] > 4.3 Supporting insert into (string => binary), TODO: update, > delete for binary > 4.4 Don’t support stream table. > => refer hive and Spark2.4 image DataSource > > mail list: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)