[jira] [Updated] (ORC-1986) Trigger flush stripe for large input rows

Wan Kun (Jira) Wed, 03 Sep 2025 03:22:18 -0700


     [ 
https://issues.apache.org/jira/browse/ORC-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wan Kun updated ORC-1986:
-------------------------
    Description: 
For large input rows, the stripe may very large and needs more memory to read 
and write each strip, we can check the tree write size in bytes and flush the 
strip even when the input rows count is less than 5000.

{code:java}
Stripes:
  Stripe: offset: 3 data: 347494188 rows: 5120 tail: 244 index: 2304
    Stream: column 0 section ROW_INDEX start: 3 length 12
    Stream: column 1 section ROW_INDEX start: 15 length 110
    Stream: column 2 section ROW_INDEX start: 125 length 893
    Stream: column 3 section ROW_INDEX start: 1018 length 31
    Stream: column 4 section ROW_INDEX start: 1049 length 65
    Stream: column 5 section ROW_INDEX start: 1114 length 923
    Stream: column 6 section ROW_INDEX start: 2037 length 25
    Stream: column 7 section ROW_INDEX start: 2062 length 155
    Stream: column 8 section ROW_INDEX start: 2217 length 28
    Stream: column 9 section ROW_INDEX start: 2245 length 31
    Stream: column 10 section ROW_INDEX start: 2276 length 31
    Stream: column 1 section DATA start: 2307 length 81853
    Stream: column 1 section LENGTH start: 84160 length 2191
    Stream: column 2 section DATA start: 86351 length 345862763
    Stream: column 2 section LENGTH start: 345949114 length 13736
    Stream: column 3 section DATA start: 345962850 length 22
    Stream: column 3 section LENGTH start: 345962872 length 6
    Stream: column 3 section DICTIONARY_DATA start: 345962878 length 5
    Stream: column 4 section PRESENT start: 345962883 length 200
    Stream: column 4 section DATA start: 345963083 length 6322
    Stream: column 4 section LENGTH start: 345969405 length 495
    Stream: column 4 section DICTIONARY_DATA start: 345969900 length 2919
    Stream: column 5 section DATA start: 345972819 length 1507883
    Stream: column 5 section LENGTH start: 347480702 length 7346
    Stream: column 6 section DATA start: 347488048 length 22
    Stream: column 6 section LENGTH start: 347488070 length 6
    Stream: column 6 section DICTIONARY_DATA start: 347488076 length 0
    Stream: column 7 section DATA start: 347488076 length 5795
    Stream: column 7 section LENGTH start: 347493871 length 301
    Stream: column 7 section DICTIONARY_DATA start: 347494172 length 2187
    Stream: column 8 section DATA start: 347496359 length 22
    Stream: column 8 section LENGTH start: 347496381 length 6
    Stream: column 8 section DICTIONARY_DATA start: 347496387 length 4
    Stream: column 9 section DATA start: 347496391 length 58
    Stream: column 9 section LENGTH start: 347496449 length 6
    Stream: column 9 section DICTIONARY_DATA start: 347496455 length 7
    Stream: column 10 section DATA start: 347496462 length 22
    Stream: column 10 section LENGTH start: 347496484 length 6
    Stream: column 10 section DICTIONARY_DATA start: 347496490 length 5
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2
    Encoding column 3: DICTIONARY_V2[1]
    Encoding column 4: DICTIONARY_V2[661]
    Encoding column 5: DIRECT_V2
    Encoding column 6: DICTIONARY_V2[1]
    Encoding column 7: DICTIONARY_V2[682]
    Encoding column 8: DICTIONARY_V2[1]
    Encoding column 9: DICTIONARY_V2[2]
    Encoding column 10: DICTIONARY_V2[1]
{code}


  was:
For large input rows, the stripe may very large and needs more memory to read 
and write each strip, we can check the tree write size in bytes and flush the 
strip enen the input rows count is less than 5000.

{code:java}
Stripes:
  Stripe: offset: 3 data: 347494188 rows: 5120 tail: 244 index: 2304
    Stream: column 0 section ROW_INDEX start: 3 length 12
    Stream: column 1 section ROW_INDEX start: 15 length 110
    Stream: column 2 section ROW_INDEX start: 125 length 893
    Stream: column 3 section ROW_INDEX start: 1018 length 31
    Stream: column 4 section ROW_INDEX start: 1049 length 65
    Stream: column 5 section ROW_INDEX start: 1114 length 923
    Stream: column 6 section ROW_INDEX start: 2037 length 25
    Stream: column 7 section ROW_INDEX start: 2062 length 155
    Stream: column 8 section ROW_INDEX start: 2217 length 28
    Stream: column 9 section ROW_INDEX start: 2245 length 31
    Stream: column 10 section ROW_INDEX start: 2276 length 31
    Stream: column 1 section DATA start: 2307 length 81853
    Stream: column 1 section LENGTH start: 84160 length 2191
    Stream: column 2 section DATA start: 86351 length 345862763
    Stream: column 2 section LENGTH start: 345949114 length 13736
    Stream: column 3 section DATA start: 345962850 length 22
    Stream: column 3 section LENGTH start: 345962872 length 6
    Stream: column 3 section DICTIONARY_DATA start: 345962878 length 5
    Stream: column 4 section PRESENT start: 345962883 length 200
    Stream: column 4 section DATA start: 345963083 length 6322
    Stream: column 4 section LENGTH start: 345969405 length 495
    Stream: column 4 section DICTIONARY_DATA start: 345969900 length 2919
    Stream: column 5 section DATA start: 345972819 length 1507883
    Stream: column 5 section LENGTH start: 347480702 length 7346
    Stream: column 6 section DATA start: 347488048 length 22
    Stream: column 6 section LENGTH start: 347488070 length 6
    Stream: column 6 section DICTIONARY_DATA start: 347488076 length 0
    Stream: column 7 section DATA start: 347488076 length 5795
    Stream: column 7 section LENGTH start: 347493871 length 301
    Stream: column 7 section DICTIONARY_DATA start: 347494172 length 2187
    Stream: column 8 section DATA start: 347496359 length 22
    Stream: column 8 section LENGTH start: 347496381 length 6
    Stream: column 8 section DICTIONARY_DATA start: 347496387 length 4
    Stream: column 9 section DATA start: 347496391 length 58
    Stream: column 9 section LENGTH start: 347496449 length 6
    Stream: column 9 section DICTIONARY_DATA start: 347496455 length 7
    Stream: column 10 section DATA start: 347496462 length 22
    Stream: column 10 section LENGTH start: 347496484 length 6
    Stream: column 10 section DICTIONARY_DATA start: 347496490 length 5
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2
    Encoding column 3: DICTIONARY_V2[1]
    Encoding column 4: DICTIONARY_V2[661]
    Encoding column 5: DIRECT_V2
    Encoding column 6: DICTIONARY_V2[1]
    Encoding column 7: DICTIONARY_V2[682]
    Encoding column 8: DICTIONARY_V2[1]
    Encoding column 9: DICTIONARY_V2[2]
    Encoding column 10: DICTIONARY_V2[1]
{code}



> Trigger flush stripe for large input rows
> -----------------------------------------
>
>                 Key: ORC-1986
>                 URL: https://issues.apache.org/jira/browse/ORC-1986
>             Project: ORC
>          Issue Type: Improvement
>            Reporter: Wan Kun
>            Priority: Major
>
> For large input rows, the stripe may very large and needs more memory to read 
> and write each strip, we can check the tree write size in bytes and flush the 
> strip even when the input rows count is less than 5000.
> {code:java}
> Stripes:
>   Stripe: offset: 3 data: 347494188 rows: 5120 tail: 244 index: 2304
>     Stream: column 0 section ROW_INDEX start: 3 length 12
>     Stream: column 1 section ROW_INDEX start: 15 length 110
>     Stream: column 2 section ROW_INDEX start: 125 length 893
>     Stream: column 3 section ROW_INDEX start: 1018 length 31
>     Stream: column 4 section ROW_INDEX start: 1049 length 65
>     Stream: column 5 section ROW_INDEX start: 1114 length 923
>     Stream: column 6 section ROW_INDEX start: 2037 length 25
>     Stream: column 7 section ROW_INDEX start: 2062 length 155
>     Stream: column 8 section ROW_INDEX start: 2217 length 28
>     Stream: column 9 section ROW_INDEX start: 2245 length 31
>     Stream: column 10 section ROW_INDEX start: 2276 length 31
>     Stream: column 1 section DATA start: 2307 length 81853
>     Stream: column 1 section LENGTH start: 84160 length 2191
>     Stream: column 2 section DATA start: 86351 length 345862763
>     Stream: column 2 section LENGTH start: 345949114 length 13736
>     Stream: column 3 section DATA start: 345962850 length 22
>     Stream: column 3 section LENGTH start: 345962872 length 6
>     Stream: column 3 section DICTIONARY_DATA start: 345962878 length 5
>     Stream: column 4 section PRESENT start: 345962883 length 200
>     Stream: column 4 section DATA start: 345963083 length 6322
>     Stream: column 4 section LENGTH start: 345969405 length 495
>     Stream: column 4 section DICTIONARY_DATA start: 345969900 length 2919
>     Stream: column 5 section DATA start: 345972819 length 1507883
>     Stream: column 5 section LENGTH start: 347480702 length 7346
>     Stream: column 6 section DATA start: 347488048 length 22
>     Stream: column 6 section LENGTH start: 347488070 length 6
>     Stream: column 6 section DICTIONARY_DATA start: 347488076 length 0
>     Stream: column 7 section DATA start: 347488076 length 5795
>     Stream: column 7 section LENGTH start: 347493871 length 301
>     Stream: column 7 section DICTIONARY_DATA start: 347494172 length 2187
>     Stream: column 8 section DATA start: 347496359 length 22
>     Stream: column 8 section LENGTH start: 347496381 length 6
>     Stream: column 8 section DICTIONARY_DATA start: 347496387 length 4
>     Stream: column 9 section DATA start: 347496391 length 58
>     Stream: column 9 section LENGTH start: 347496449 length 6
>     Stream: column 9 section DICTIONARY_DATA start: 347496455 length 7
>     Stream: column 10 section DATA start: 347496462 length 22
>     Stream: column 10 section LENGTH start: 347496484 length 6
>     Stream: column 10 section DICTIONARY_DATA start: 347496490 length 5
>     Encoding column 0: DIRECT
>     Encoding column 1: DIRECT_V2
>     Encoding column 2: DIRECT_V2
>     Encoding column 3: DICTIONARY_V2[1]
>     Encoding column 4: DICTIONARY_V2[661]
>     Encoding column 5: DIRECT_V2
>     Encoding column 6: DICTIONARY_V2[1]
>     Encoding column 7: DICTIONARY_V2[682]
>     Encoding column 8: DICTIONARY_V2[1]
>     Encoding column 9: DICTIONARY_V2[2]
>     Encoding column 10: DICTIONARY_V2[1]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ORC-1986) Trigger flush stripe for large input rows

Reply via email to