[
https://issues.apache.org/jira/browse/OOZIE-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715865#comment-13715865
]
Virag Kothari commented on OOZIE-1462:
--------------------------------------
I took the data from missing_deps column of the largest size and compressed
using gzip default compression:
Below are results
Original size of data: 1590438 bytes
Default compression: 30918 bytes
Compression reduces the size by almost 50 times
Also, compression on the another col 'wf_instance' which primarily stores the
wf definition
Original size of wf_instance: 230379
Default compression: 18233
Compression on this col reduces the size by 10 times
As this two blobs are usually the largest among others, I performed the
experiments on them. But I believe compression on other cols will yield similar
results.
The data will be in binary form after compressing. So we need to change all
character lobs (String) to binary lobs (byte[]) which will require schema
change.
Also, if we compress, we can use the inline storage benefits provided for lobs.
Inline storage length of oracle is 4000 bytes while that in mysql is 768.
Experiment on oracle:
Inserted 10,000 rows in a dummy table with two clob columns. The first column
has 4000 bytes while the other has 4001 bytes.
Avg Query time for retrieving 10,000 rows with col of 4000 bytes: ~1.5 seconds
Avg Query time for retrieving 10,000 rows with col of 4001 bytes: ~4.5 seconds
So we see quite an improvement if the col size is less than or equal to 4000
bytes.
> Compress lob columns before storing in database
> -----------------------------------------------
>
> Key: OOZIE-1462
> URL: https://issues.apache.org/jira/browse/OOZIE-1462
> Project: Oozie
> Issue Type: Improvement
> Reporter: Virag Kothari
> Assignee: Virag Kothari
>
> Storing huge data in lobs is very inefficient. Making Oozie compress the data
> before storing will reduce size of data to be stored in lobs and help in
> reducing the time for queries. Also most databases like oracle, mysql support
> storing lob data in tablerow (inline) if the data is of smaller size. Inline
> storage has much better performance compared to outline storage (storage
> outside of tablerow)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira