[ https://issues.apache.org/jira/browse/ORC-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wan Kun updated ORC-1986: ------------------------- Description: For large input rows, the stripe may very large and needs more memory to read and write each strip, we can check the tree write size in bytes and flush the strip even when the input rows count is less than 5000. {code:java} Stripes: Stripe: offset: 3 data: 347494188 rows: 5120 tail: 244 index: 2304 Stream: column 0 section ROW_INDEX start: 3 length 12 Stream: column 1 section ROW_INDEX start: 15 length 110 Stream: column 2 section ROW_INDEX start: 125 length 893 Stream: column 3 section ROW_INDEX start: 1018 length 31 Stream: column 4 section ROW_INDEX start: 1049 length 65 Stream: column 5 section ROW_INDEX start: 1114 length 923 Stream: column 6 section ROW_INDEX start: 2037 length 25 Stream: column 7 section ROW_INDEX start: 2062 length 155 Stream: column 8 section ROW_INDEX start: 2217 length 28 Stream: column 9 section ROW_INDEX start: 2245 length 31 Stream: column 10 section ROW_INDEX start: 2276 length 31 Stream: column 1 section DATA start: 2307 length 81853 Stream: column 1 section LENGTH start: 84160 length 2191 Stream: column 2 section DATA start: 86351 length 345862763 Stream: column 2 section LENGTH start: 345949114 length 13736 Stream: column 3 section DATA start: 345962850 length 22 Stream: column 3 section LENGTH start: 345962872 length 6 Stream: column 3 section DICTIONARY_DATA start: 345962878 length 5 Stream: column 4 section PRESENT start: 345962883 length 200 Stream: column 4 section DATA start: 345963083 length 6322 Stream: column 4 section LENGTH start: 345969405 length 495 Stream: column 4 section DICTIONARY_DATA start: 345969900 length 2919 Stream: column 5 section DATA start: 345972819 length 1507883 Stream: column 5 section LENGTH start: 347480702 length 7346 Stream: column 6 section DATA start: 347488048 length 22 Stream: column 6 section LENGTH start: 347488070 length 6 Stream: column 6 section DICTIONARY_DATA start: 347488076 length 0 Stream: column 7 section DATA start: 347488076 length 5795 Stream: column 7 section LENGTH start: 347493871 length 301 Stream: column 7 section DICTIONARY_DATA start: 347494172 length 2187 Stream: column 8 section DATA start: 347496359 length 22 Stream: column 8 section LENGTH start: 347496381 length 6 Stream: column 8 section DICTIONARY_DATA start: 347496387 length 4 Stream: column 9 section DATA start: 347496391 length 58 Stream: column 9 section LENGTH start: 347496449 length 6 Stream: column 9 section DICTIONARY_DATA start: 347496455 length 7 Stream: column 10 section DATA start: 347496462 length 22 Stream: column 10 section LENGTH start: 347496484 length 6 Stream: column 10 section DICTIONARY_DATA start: 347496490 length 5 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DIRECT_V2 Encoding column 3: DICTIONARY_V2[1] Encoding column 4: DICTIONARY_V2[661] Encoding column 5: DIRECT_V2 Encoding column 6: DICTIONARY_V2[1] Encoding column 7: DICTIONARY_V2[682] Encoding column 8: DICTIONARY_V2[1] Encoding column 9: DICTIONARY_V2[2] Encoding column 10: DICTIONARY_V2[1] {code} was: For large input rows, the stripe may very large and needs more memory to read and write each strip, we can check the tree write size in bytes and flush the strip enen the input rows count is less than 5000. {code:java} Stripes: Stripe: offset: 3 data: 347494188 rows: 5120 tail: 244 index: 2304 Stream: column 0 section ROW_INDEX start: 3 length 12 Stream: column 1 section ROW_INDEX start: 15 length 110 Stream: column 2 section ROW_INDEX start: 125 length 893 Stream: column 3 section ROW_INDEX start: 1018 length 31 Stream: column 4 section ROW_INDEX start: 1049 length 65 Stream: column 5 section ROW_INDEX start: 1114 length 923 Stream: column 6 section ROW_INDEX start: 2037 length 25 Stream: column 7 section ROW_INDEX start: 2062 length 155 Stream: column 8 section ROW_INDEX start: 2217 length 28 Stream: column 9 section ROW_INDEX start: 2245 length 31 Stream: column 10 section ROW_INDEX start: 2276 length 31 Stream: column 1 section DATA start: 2307 length 81853 Stream: column 1 section LENGTH start: 84160 length 2191 Stream: column 2 section DATA start: 86351 length 345862763 Stream: column 2 section LENGTH start: 345949114 length 13736 Stream: column 3 section DATA start: 345962850 length 22 Stream: column 3 section LENGTH start: 345962872 length 6 Stream: column 3 section DICTIONARY_DATA start: 345962878 length 5 Stream: column 4 section PRESENT start: 345962883 length 200 Stream: column 4 section DATA start: 345963083 length 6322 Stream: column 4 section LENGTH start: 345969405 length 495 Stream: column 4 section DICTIONARY_DATA start: 345969900 length 2919 Stream: column 5 section DATA start: 345972819 length 1507883 Stream: column 5 section LENGTH start: 347480702 length 7346 Stream: column 6 section DATA start: 347488048 length 22 Stream: column 6 section LENGTH start: 347488070 length 6 Stream: column 6 section DICTIONARY_DATA start: 347488076 length 0 Stream: column 7 section DATA start: 347488076 length 5795 Stream: column 7 section LENGTH start: 347493871 length 301 Stream: column 7 section DICTIONARY_DATA start: 347494172 length 2187 Stream: column 8 section DATA start: 347496359 length 22 Stream: column 8 section LENGTH start: 347496381 length 6 Stream: column 8 section DICTIONARY_DATA start: 347496387 length 4 Stream: column 9 section DATA start: 347496391 length 58 Stream: column 9 section LENGTH start: 347496449 length 6 Stream: column 9 section DICTIONARY_DATA start: 347496455 length 7 Stream: column 10 section DATA start: 347496462 length 22 Stream: column 10 section LENGTH start: 347496484 length 6 Stream: column 10 section DICTIONARY_DATA start: 347496490 length 5 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DIRECT_V2 Encoding column 3: DICTIONARY_V2[1] Encoding column 4: DICTIONARY_V2[661] Encoding column 5: DIRECT_V2 Encoding column 6: DICTIONARY_V2[1] Encoding column 7: DICTIONARY_V2[682] Encoding column 8: DICTIONARY_V2[1] Encoding column 9: DICTIONARY_V2[2] Encoding column 10: DICTIONARY_V2[1] {code} > Trigger flush stripe for large input rows > ----------------------------------------- > > Key: ORC-1986 > URL: https://issues.apache.org/jira/browse/ORC-1986 > Project: ORC > Issue Type: Improvement > Reporter: Wan Kun > Priority: Major > > For large input rows, the stripe may very large and needs more memory to read > and write each strip, we can check the tree write size in bytes and flush the > strip even when the input rows count is less than 5000. > {code:java} > Stripes: > Stripe: offset: 3 data: 347494188 rows: 5120 tail: 244 index: 2304 > Stream: column 0 section ROW_INDEX start: 3 length 12 > Stream: column 1 section ROW_INDEX start: 15 length 110 > Stream: column 2 section ROW_INDEX start: 125 length 893 > Stream: column 3 section ROW_INDEX start: 1018 length 31 > Stream: column 4 section ROW_INDEX start: 1049 length 65 > Stream: column 5 section ROW_INDEX start: 1114 length 923 > Stream: column 6 section ROW_INDEX start: 2037 length 25 > Stream: column 7 section ROW_INDEX start: 2062 length 155 > Stream: column 8 section ROW_INDEX start: 2217 length 28 > Stream: column 9 section ROW_INDEX start: 2245 length 31 > Stream: column 10 section ROW_INDEX start: 2276 length 31 > Stream: column 1 section DATA start: 2307 length 81853 > Stream: column 1 section LENGTH start: 84160 length 2191 > Stream: column 2 section DATA start: 86351 length 345862763 > Stream: column 2 section LENGTH start: 345949114 length 13736 > Stream: column 3 section DATA start: 345962850 length 22 > Stream: column 3 section LENGTH start: 345962872 length 6 > Stream: column 3 section DICTIONARY_DATA start: 345962878 length 5 > Stream: column 4 section PRESENT start: 345962883 length 200 > Stream: column 4 section DATA start: 345963083 length 6322 > Stream: column 4 section LENGTH start: 345969405 length 495 > Stream: column 4 section DICTIONARY_DATA start: 345969900 length 2919 > Stream: column 5 section DATA start: 345972819 length 1507883 > Stream: column 5 section LENGTH start: 347480702 length 7346 > Stream: column 6 section DATA start: 347488048 length 22 > Stream: column 6 section LENGTH start: 347488070 length 6 > Stream: column 6 section DICTIONARY_DATA start: 347488076 length 0 > Stream: column 7 section DATA start: 347488076 length 5795 > Stream: column 7 section LENGTH start: 347493871 length 301 > Stream: column 7 section DICTIONARY_DATA start: 347494172 length 2187 > Stream: column 8 section DATA start: 347496359 length 22 > Stream: column 8 section LENGTH start: 347496381 length 6 > Stream: column 8 section DICTIONARY_DATA start: 347496387 length 4 > Stream: column 9 section DATA start: 347496391 length 58 > Stream: column 9 section LENGTH start: 347496449 length 6 > Stream: column 9 section DICTIONARY_DATA start: 347496455 length 7 > Stream: column 10 section DATA start: 347496462 length 22 > Stream: column 10 section LENGTH start: 347496484 length 6 > Stream: column 10 section DICTIONARY_DATA start: 347496490 length 5 > Encoding column 0: DIRECT > Encoding column 1: DIRECT_V2 > Encoding column 2: DIRECT_V2 > Encoding column 3: DICTIONARY_V2[1] > Encoding column 4: DICTIONARY_V2[661] > Encoding column 5: DIRECT_V2 > Encoding column 6: DICTIONARY_V2[1] > Encoding column 7: DICTIONARY_V2[682] > Encoding column 8: DICTIONARY_V2[1] > Encoding column 9: DICTIONARY_V2[2] > Encoding column 10: DICTIONARY_V2[1] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)