spaces-X opened a new pull request, #9436: URL: https://github.com/apache/incubator-doris/pull/9436
# Proposed changes The `row_number()` function in **spark** returns **an integer type value**. It will cause two problems in Spark Load. **case 1: loading a large amount of data at one time causes `row_number()` overflow.** When the cardinality of the columns to be encoded in the **data imported at one time** is more than **2.1 billion**, `row_number` will return a negative number. **case 2: loading data by many times causes the maximum dict_value in the global dictionary to exceed Integer, but we do not cast it to bigint.** --- For case 1, I think it's a design flaw that causes a bottleneck on one-time loading and case 1 has relatively few scenarios, which can be solved by importing in multiple batches in the short term. For case 2, it will be solved by this pr. ## Problem Summary: Describe the overview of changes. ## Checklist(Required) 1. Does it affect the original behavior: (Yes/No/I Don't know) 2. Has unit tests been added: (Yes/No/No Need) 3. Has document been added or modified: (Yes/No/No Need) 4. Does it need to update dependencies: (Yes/No) 5. Are there any changes that cannot be rolled back: (Yes/No) ## Further comments If this is a relatively large or complex change, kick off the discussion at [[email protected]](mailto:[email protected]) by explaining why you chose the solution you did and what alternatives you considered, etc... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
