[ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821005#comment-16821005
 ] 

xubo245 commented on CARBONDATA-3336:
-------------------------------------

Array<Binary>:org.apache.carbondata.processing.loading.parser.impl.RowParserImpl#parseRow

> Support Binary Data Type
> ------------------------
>
>                 Key: CARBONDATA-3336
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-3336
>             Project: CarbonData
>          Issue Type: New Feature
>            Reporter: xubo245
>            Assignee: xubo245
>            Priority: Major
>         Attachments: CarbonData support binary data type V0.1.pdf
>
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> CarbonData supports binary data type
> Version       Changes Owner   Date
> 0.1   Init doc for Supporting binary data type        Xubo    2019-4-10
> Background :
> Binary is basic data type and widely used in various scenarios. So it’s 
> better to support binary data type in CarbonData. Download data from S3 will 
> be slow when dataset has lots of small binary data. The majority of 
> application scenarios are  related to storage small binary data type into 
> CarbonData, which can avoid small binary files problem and speed up S3 access 
> performance, also can decrease cost of accessing OBS by decreasing the number 
> of calling S3 API. It also will easier to manage structure data and 
> Unstructured data(binary) by storing them into CarbonData. 
> Goals:
> 1. Supporting write binary data type by Carbon Java SDK.
> 2. Supporting read binary data type by Spark Carbon file format(carbon 
> datasource) and CarbonSession.
> 3. Supporting read binary data type by Carbon SDK
> 4. Supporting write binary by spark
> Approach and Detail:
>       1.Supporting write binary data type by Carbon Java SDK [Formal]:
>           1.1 Java SDK needs support write data with specific data types, 
> like int, double, byte[ ] data type, no need to convert all data type to 
> string array. User read binary file as byte[], then SDK writes byte[] into 
> binary column.  
>           1.2 CarbonData compress binary column because now the compressor is 
> table level.
>               =>TODO, support configuration for compress, default is no 
> compress because binary usually is already compressed, like jpg format image. 
> So no need to uncompress for binary column. 1.5.4 will support column level 
> compression, after that, we can implement no compress for binary. We can talk 
> with community.
>           1.3 CarbonData stores binary as dimension.
>           1.4 Support configure page size for binary data type because binary 
> data usually is big, such as 200k. Otherwise it will be very big for one 
> blocklet (32000 rows).
>         TODO: 1.5 Avro, JSON convert need consider            
>       
>       2. Supporting read and manage binary data type by Spark Carbon file 
> format(carbon DataSource) and CarbonSession.[Formal]
>           2.1 Supporting read binary data type from non-transaction table, 
> read binary column and return as byte[]
>           2.2 Support create table with binary column, table property doesn’t 
> support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
> column
>       => Evaluate COLUMN_META_CACHE for binary
>        => CARBON Datasource don't support dictionary include column
>        => carbon.column.compressor for all columns
>           2.3 Support CTAS for binary=> transaction/non-transaction
>           2.4 Support external table for binary
>           2.5 Support projection for binary column
>           2.6 Support desc formatted
>                    => Carbon Datasource don't support  ALTER TABLE add 
> calumny sql
>                    =>TODO: ALTER TABLE for binary data type in carbon session
>           2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS  for binary      
>           2.8 Support compaction for binary(TODO)
>           2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
>  no need min max datamap for binary, support mv and pre-aggregate in the 
> future
>           2.10 CSDK / python SDK support binary in the future.(TODO)
>           2.11 Support S3
>           TODO:
>             2.12 support UDF, hex, base64, cast:
>                    select hex(bin) from carbon_table.
>                    select CAST(s AS BINARY) from carbon_table.
> CarbonSession: impact analysis
>       
>       3. Supporting read binary data type by Carbon SDK
>           3.1 Supporting read binary data type from non-transaction table, 
> read binary column and return as byte[]
>           3.2 Supporting projection for binary column
>           3.3 Supporting S3
>           3.4 no need to support filter.
>       4. Supporting write binary by spark (carbon file format / 
> carbonsession, POC??)
>           4.1 Convert binary to String and storage in CSV
>           4.2 Spark load CSV and convert string to byte[], and storage in 
> CarbonData. read binary column and return as byte[]
>           4.3 Supporting insert into (string => binary),  TODO: update, 
> delete for binary
>           4.4 Don’t support stream table.
>       => refer hive and Spark2.4 image DataSource
>  
> mail list: 
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to