[
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xubo245 updated CARBONDATA-3336:
--------------------------------
Description:
CarbonData supports binary data type
Version Changes Owner Date
0.1 Init doc for Supporting binary data type Xubo 2019-4-10
Background :
Binary is basic data type and widely used in various scenarios. So it’s better
to support binary data type in CarbonData. Download data from S3 will be slow
when dataset has lots of small binary data. The majority of application
scenarios are related to storage small binary data type into CarbonData, which
can avoid small binary files problem and speed up S3 access performance, also
can decrease cost of accessing OBS by decreasing the number of calling S3 API.
It also will easier to manage structure data and Unstructured data(binary) by
storing them into CarbonData.
Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark
Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types,
like int, double, byte[ ] data type, no need to convert all data type to string
array. User read binary file as byte[], then SDK writes byte[] into binary
column.
1.2 CarbonData compress binary column because now the compressor is
table level.
=>TODO, support configuration for compress, default is no
compress because binary usually is already compressed, like jpg format image.
So no need to uncompress for binary column. 1.5.4 will support column level
compression, after that, we can implement no compress for binary. We can talk
with community.
1.3 CarbonData stores binary as dimension.
1.4 Support configure page size for binary data type because binary
data usually is big, such as 200k. Otherwise it will be very big for one
blocklet (32000 rows).
TODO: 1.5 Avro, JSON convert need consider
2. Supporting read and manage binary data type by Spark Carbon file
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table,
read binary column and return as byte[]
2.2 Support create table with binary column, table property doesn’t
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary
column
=> Evaluate COLUMN_META_CACHE for binary
=> CARBON Datasource don't support dictionary include column
=> carbon.column.compressor for all columns
2.3 Support CTAS for binary=> transaction/non-transaction
2.4 Support external table for binary
2.5 Support projection for binary column
2.6 Support desc formatted
=> Carbon Datasource don't support ALTER TABLE add calumny
sql
=>TODO: ALTER TABLE for binary data type in carbon session
2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS for binary
2.8 Support compaction for binary(TODO)
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap,
no need min max datamap for binary, support mv and pre-aggregate in the future
2.10 CSDK / python SDK support binary in the future.(TODO)
2.11 Support S3
TODO:
2.12 support UDF, hex, base64, cast:
select hex(bin) from carbon_table.
select CAST(s AS BINARY) from carbon_table.
CarbonSession: impact analysis
2.13 support configurable decode for query, support base64 and Hex
decode.
2.14 Proper Error message for not supported features like
delete/update/mv/SI/sort/bloom/streaming.
2.15 How big data size binary data type can support for writing and
reading?
3. Supporting read binary data type by Carbon SDK
3.1 Supporting read binary data type from non-transaction table,
read binary column and return as byte[]
3.2 Supporting projection for binary column
3.3 Supporting S3
3.4 no need to support filter.
4. Supporting write binary by spark (carbon file format /
carbonsession, POC??)
4.1 Convert binary to String and storage in CSV
4.2 Spark load CSV and convert string to byte[], and storage in
CarbonData. read binary column and return as byte[]
4.3 Supporting insert into (string => binary), TODO: update,
delete for binary
4.4 Don’t support stream table.
=> refer hive and Spark2.4 image DataSource
5. CLI tool support binary data type column
mail list:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html
was:
CarbonData supports binary data type
Version Changes Owner Date
0.1 Init doc for Supporting binary data type Xubo 2019-4-10
Background :
Binary is basic data type and widely used in various scenarios. So it’s better
to support binary data type in CarbonData. Download data from S3 will be slow
when dataset has lots of small binary data. The majority of application
scenarios are related to storage small binary data type into CarbonData, which
can avoid small binary files problem and speed up S3 access performance, also
can decrease cost of accessing OBS by decreasing the number of calling S3 API.
It also will easier to manage structure data and Unstructured data(binary) by
storing them into CarbonData.
Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark
Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types,
like int, double, byte[ ] data type, no need to convert all data type to string
array. User read binary file as byte[], then SDK writes byte[] into binary
column.
1.2 CarbonData compress binary column because now the compressor is
table level.
=>TODO, support configuration for compress, default is no
compress because binary usually is already compressed, like jpg format image.
So no need to uncompress for binary column. 1.5.4 will support column level
compression, after that, we can implement no compress for binary. We can talk
with community.
1.3 CarbonData stores binary as dimension.
1.4 Support configure page size for binary data type because binary
data usually is big, such as 200k. Otherwise it will be very big for one
blocklet (32000 rows).
TODO: 1.5 Avro, JSON convert need consider
2. Supporting read and manage binary data type by Spark Carbon file
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table,
read binary column and return as byte[]
2.2 Support create table with binary column, table property doesn’t
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary
column
=> Evaluate COLUMN_META_CACHE for binary
=> CARBON Datasource don't support dictionary include column
=> carbon.column.compressor for all columns
2.3 Support CTAS for binary=> transaction/non-transaction
2.4 Support external table for binary
2.5 Support projection for binary column
2.6 Support desc formatted
=> Carbon Datasource don't support ALTER TABLE add calumny
sql
=>TODO: ALTER TABLE for binary data type in carbon session
2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS for binary
2.8 Support compaction for binary(TODO)
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap,
no need min max datamap for binary, support mv and pre-aggregate in the future
2.10 CSDK / python SDK support binary in the future.(TODO)
2.11 Support S3
TODO:
2.12 support UDF, hex, base64, cast:
select hex(bin) from carbon_table.
select CAST(s AS BINARY) from carbon_table.
CarbonSession: impact analysis
3. Supporting read binary data type by Carbon SDK
3.1 Supporting read binary data type from non-transaction table,
read binary column and return as byte[]
3.2 Supporting projection for binary column
3.3 Supporting S3
3.4 no need to support filter.
4. Supporting write binary by spark (carbon file format /
carbonsession, POC??)
4.1 Convert binary to String and storage in CSV
4.2 Spark load CSV and convert string to byte[], and storage in
CarbonData. read binary column and return as byte[]
4.3 Supporting insert into (string => binary), TODO: update,
delete for binary
4.4 Don’t support stream table.
=> refer hive and Spark2.4 image DataSource
mail list:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html
> Support Binary Data Type
> ------------------------
>
> Key: CARBONDATA-3336
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3336
> Project: CarbonData
> Issue Type: New Feature
> Reporter: xubo245
> Assignee: xubo245
> Priority: Major
> Attachments: CarbonData support binary data type V0.1.pdf
>
> Time Spent: 4h 10m
> Remaining Estimate: 0h
>
> CarbonData supports binary data type
> Version Changes Owner Date
> 0.1 Init doc for Supporting binary data type Xubo 2019-4-10
> Background :
> Binary is basic data type and widely used in various scenarios. So it’s
> better to support binary data type in CarbonData. Download data from S3 will
> be slow when dataset has lots of small binary data. The majority of
> application scenarios are related to storage small binary data type into
> CarbonData, which can avoid small binary files problem and speed up S3 access
> performance, also can decrease cost of accessing OBS by decreasing the number
> of calling S3 API. It also will easier to manage structure data and
> Unstructured data(binary) by storing them into CarbonData.
> Goals:
> 1. Supporting write binary data type by Carbon Java SDK.
> 2. Supporting read binary data type by Spark Carbon file format(carbon
> datasource) and CarbonSession.
> 3. Supporting read binary data type by Carbon SDK
> 4. Supporting write binary by spark
> Approach and Detail:
> 1.Supporting write binary data type by Carbon Java SDK [Formal]:
> 1.1 Java SDK needs support write data with specific data types,
> like int, double, byte[ ] data type, no need to convert all data type to
> string array. User read binary file as byte[], then SDK writes byte[] into
> binary column.
> 1.2 CarbonData compress binary column because now the compressor is
> table level.
> =>TODO, support configuration for compress, default is no
> compress because binary usually is already compressed, like jpg format image.
> So no need to uncompress for binary column. 1.5.4 will support column level
> compression, after that, we can implement no compress for binary. We can talk
> with community.
> 1.3 CarbonData stores binary as dimension.
> 1.4 Support configure page size for binary data type because binary
> data usually is big, such as 200k. Otherwise it will be very big for one
> blocklet (32000 rows).
> TODO: 1.5 Avro, JSON convert need consider
>
> 2. Supporting read and manage binary data type by Spark Carbon file
> format(carbon DataSource) and CarbonSession.[Formal]
> 2.1 Supporting read binary data type from non-transaction table,
> read binary column and return as byte[]
> 2.2 Support create table with binary column, table property doesn’t
> support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary
> column
> => Evaluate COLUMN_META_CACHE for binary
> => CARBON Datasource don't support dictionary include column
> => carbon.column.compressor for all columns
> 2.3 Support CTAS for binary=> transaction/non-transaction
> 2.4 Support external table for binary
> 2.5 Support projection for binary column
> 2.6 Support desc formatted
> => Carbon Datasource don't support ALTER TABLE add
> calumny sql
> =>TODO: ALTER TABLE for binary data type in carbon session
> 2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS for binary
> 2.8 Support compaction for binary(TODO)
> 2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap,
> no need min max datamap for binary, support mv and pre-aggregate in the
> future
> 2.10 CSDK / python SDK support binary in the future.(TODO)
> 2.11 Support S3
> TODO:
> 2.12 support UDF, hex, base64, cast:
> select hex(bin) from carbon_table.
> select CAST(s AS BINARY) from carbon_table.
> CarbonSession: impact analysis
> 2.13 support configurable decode for query, support base64 and
> Hex decode.
> 2.14 Proper Error message for not supported features like
> delete/update/mv/SI/sort/bloom/streaming.
> 2.15 How big data size binary data type can support for writing
> and reading?
>
>
>
> 3. Supporting read binary data type by Carbon SDK
> 3.1 Supporting read binary data type from non-transaction table,
> read binary column and return as byte[]
> 3.2 Supporting projection for binary column
> 3.3 Supporting S3
> 3.4 no need to support filter.
> 4. Supporting write binary by spark (carbon file format /
> carbonsession, POC??)
> 4.1 Convert binary to String and storage in CSV
> 4.2 Spark load CSV and convert string to byte[], and storage in
> CarbonData. read binary column and return as byte[]
> 4.3 Supporting insert into (string => binary), TODO: update,
> delete for binary
> 4.4 Don’t support stream table.
> => refer hive and Spark2.4 image DataSource
> 5. CLI tool support binary data type column
>
> mail list:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)