[ 
https://issues.apache.org/jira/browse/CARBONDATA-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 resolved CARBONDATA-3351.
---------------------------------
    Resolution: Fixed

> Support Binary Data Type
> ------------------------
>
>                 Key: CARBONDATA-3351
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-3351
>             Project: CarbonData
>          Issue Type: Sub-task
>            Reporter: xubo245
>            Assignee: xubo245
>            Priority: Major
>          Time Spent: 35h
>  Remaining Estimate: 0h
>
> Background :
> Binary is basic data type and widely used in various scenarios. So it’s 
> better to support binary data type in CarbonData. Download data from S3 will 
> be slow when dataset has lots of small binary data. The majority of 
> application scenarios are  related to storage small binary data type into 
> CarbonData, which can avoid small binary files problem and speed up S3 access 
> performance, also can decrease cost of accessing OBS by decreasing the number 
> of calling S3 API. It also will easier to manage structure data and 
> Unstructured data(binary) by storing them into CarbonData. 
> Goals:
> 1. Supporting write binary data type by Carbon Java SDK.
> 2. Supporting read binary data type by Spark Carbon file format(carbon 
> datasource) and CarbonSession.
> 3. Supporting read binary data type by Carbon SDK
> 4. Supporting write binary by spark
> Approach and Detail:
>       1.Supporting write binary data type by Carbon Java SDK [Formal]:
>           1.1 Java SDK needs support write data with specific data types, 
> like int, double, byte[ ] data type, no need to convert all data type to 
> string array. User read binary file as byte[], then SDK writes byte[] into 
> binary column.=>Done
>           1.2 CarbonData compress binary column because now the compressor is 
> table level.=>Done
>           1.3 CarbonData stores binary as dimension. => Done
>           1.4 Support configure page size for binary data type because binary 
> data usually is big, such as 200k. Otherwise it will be very big for one 
> blocklet (32000 rows). =>Done
>           1.5 Avro, JSON convert need consider
>               •       AVRO fixed and variable length binary can be supported
>                               => Avro don't support binary data type => No 
> need
>                Support read binary from JSON  => done.
>           1.6 Binay data type as a child columns in Struct, Map               
>                               
>                    => support it in the future, but priority is not very 
> high, not in 1.5.4
>           1.7 Verify what is the maximum size of the binary value supportred  
>     => snappy only support about 1.71 G, the max data size should be 2 GB, 
> but need confirm            
>       
>       2. Supporting read and manage binary data type by Spark Carbon file 
> format(carbon DataSource) and CarbonSession.[Formal]
>           2.1 Supporting read binary data type from non-transaction table, 
> read binary column and return as byte[] =>Done
>           2.2 Support create table with binary column, table property doesn’t 
> support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
> column => Done
>        => CARBON Datasource don't support dictionary include column
>        =>support  carbon.column.compressor= snappy,zstd,gzip for binary, 
> compress is for all columns(table level)
>           2.3 Support CTAS for binary=> transaction/non-transaction,  
> Carbon/Hive/Parquet => Done 
>           2.4 Support external table for binary=> Done
>           2.5 Support projection for binary column=> Done
>           2.6 Support desc formatted=> Done
>                    => Carbon Datasource don't support  ALTER TABLE add 
> columns sql
>                    support  ALTER TABLE for(add column, rename, drop column) 
> binary data type in carbon session=> Done
>                    Don't support change the data type for binary by alter 
> table => Done
>           2.7 Don’t support PARTITION, BUCKETCOLUMNS  for binary      => Done
>           2.8 Support compaction for binary=> Done
>           2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
>  no need min max datamap for binary, support mv and pre-aggregate in the 
> future=> TODO
>           2.10 CSDK / python SDK support binary in the future.=> TODO
>           2.11 Support S3=> Done
>             2.12 support UDF, hex, base64, cast:.=> TODO
>                    select hex(bin) from carbon_table..=> TODO
>                 
>             2.15 support filter for binary => Done
>             2.16 select CAST(s AS BINARY) from carbon_table. => Done
>       3. Supporting read binary data type by Carbon SDK
>           3.1 Supporting read binary data type from non-transaction table, 
> read binary column and return as byte[]=> Done
>           3.2 Supporting projection for binary column=> Done
>           3.3 Supporting S3=> Done
>           3.4 no need to support filter.=> to be discussd, not in this PR
>       4. Supporting write binary by spark (carbon file format / 
> carbonsession, POC??)
>           4.1 Convert binary to String and storage in CSV=> Done
>           4.2 Spark load CSV and convert string to byte[], and storage in 
> CarbonData. read binary column and return as byte[]=> Done
>             4.3 Support insert into/update/delete for binary data type => Done
>           4.4 Don’t support stream table. => TODO
>           4.5 Verify given value for binary column is Base64 encoded,Plain 
> String and byte[] for SDK,fileformat,caronsession. 
>            =>xubo: I think we should support configurable decode for binary, 
> like support base64 and Hex, is it ok? Hive also add TODO for configurable. 
> Hive don’t support Hex and normal string now.  => TODO, not in this PR
>            4.6  Local dictionary can be excluded =>Done
>            4.7 verify Binary type behavior with data having null values, 
> Verify the bad records logger behavior with binary column, better to keep the 
> badrecords file readable, should  we encode to base64 ?   
>                       => support it in the future. Carbon doesn’t 
> encode/decode base64 now, carbon will keep the same for output and input. 
> Carbon can support configurable encode/decode for binary.  CarbonSession  
> only support load data from files(csv), so it’s already readable for bad 
> record. How to confirm which is bad record for binary? For CarbonSDK, we can 
> support encode to base64 default and add configure parameter to convert to 
> other format, like Hex. Is it ok?  TODO, not in this PR
>            4.8 Verify with both unsafe true and false configurations for load 
> and query       => Done
>       5. CLI tool support binary data type column => Done
>       
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to