ongdisheng opened a new pull request, #37:
URL: https://github.com/apache/asterixdb/pull/37

   ## Description
   There are two bugs in `writeUTF8StringAsCSV` in 
[PrintTools.java](https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/nontagged/printers/PrintTools.java#L305C4-L305C12):
   1. Incorrect loop step in the quoting scan: 
   The loop that checks whether a string needs quoting advanced one byte at a 
time (`i++`). Since multi-byte characters span 2 to 4 bytes, this might 
actually cause `charAt()` to be called at offsets pointing to the middle of a 
character.
   
   2. Incorrect character writing:
   Characters were written using `PrintStream.print(char)`, which converts the 
`char` through the platform default charset before writing. For emoji encoded 
as surrogate pairs, each surrogate half is not a valid standalone Unicode 
character, so `PrintStream` emitted replacement characters (`?`) instead of the 
correct UTF-8 bytes.
   
   ## Fix
   Added a fix for the quoting scan loop so that it now advances by 
`UTF8StringUtil.charSize()` per iteration and `charAt()` is always called at a 
valid character boundary. Characters are now written as raw UTF-8 bytes 
directly, which is also consistent with how `writeUTF8StringAsJSON` already 
handles the same data.
   
   ## How to Reproduce and Verify
   
   <details>
   <summary>Setup</summary>
   
   ```bash
   disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode 'statement=
   DROP DATAVERSE test IF EXISTS;
   CREATE DATAVERSE test;
   USE test;
   CREATE TYPE TweetType AS { id: int, text: string };
   CREATE DATASET tweets(TweetType) PRIMARY KEY id;
   ' "http://localhost:19002/query/service";
   
   disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode 'statement=
   USE test;                                                       
   INSERT INTO tweets ({"id": 1, "text": "@ScapegoatHelp Walked out on being 
scapegoated again. Saw Narcs mask slip & that sneer. No more 💪🦋"});
   ' "http://localhost:19002/query/service";
   
   ```
   
   </details>
   
   <details>
   <summary>Before fix</summary>
   
   ```bash
   disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE 
test; SELECT text FROM tweets;"      "http://localhost:19002/query/service";
   {
           "requestID": "3056c5df-fb14-4ff8-90b6-dcc62662a563",
           "signature": {
                   "*": "*"
           },
           "results": [ {"text":"@ScapegoatHelp Walked out on being scapegoated 
again. Saw Narcs mask slip & that sneer. No more 💪🦋"} ]
           ,
           "plans":{},
           "status": "success",
           "metrics": {
                   "elapsedTime": "28.199702ms",
                   "executionTime": "26.761674ms",
                   "compileTime": "10.872533ms",
                   "queueWaitTime": "0ns",
                   "resultCount": 1,
                   "resultSize": 111,
                   "processedObjects": 1
           }
   }
   disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE 
test; SELECT text FROM tweets;"      
"http://localhost:19002/query/service?format=csv&header=absent";
   {
           "requestID": "6bbf0b3d-05a3-43e5-ad61-d3ebeee20e5e",
           "type": "text/csv; header=absent",
           "signature": {
                   "*": "*"
           },
           "errors": [{ 
                   "code": 1,              "msg": 
"java.lang.IllegalArgumentException" } 
           ],
           "status": "fatal",
           "metrics": {
                   "elapsedTime": "26.795447ms",
                   "executionTime": "25.588135ms",
                   "compileTime": "11.320646ms",
                   "queueWaitTime": "0ns",
                   "resultCount": 0,
                   "resultSize": 0,
                   "processedObjects": 0,
                   "bufferCacheHitRatio": "0.00%",
                   "bufferCachePageReadCount": 0,
                   "errorCount": 1
           }
   }
   ```
   </details>
   
   
   <details>
   <summary>After fix</summary>
   
   ```bash
   disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE 
test; SELECT text FROM tweets;" \
        "http://localhost:19002/query/service";
   {
           "requestID": "6f404f34-1726-42d7-ba7a-990d9b08cd0d",
           "signature": {
                   "*": "*"
           },
           "results": [ {"text":"@ScapegoatHelp Walked out on being scapegoated 
again. Saw Narcs mask slip & that sneer. No more 💪🦋"} ]
           ,
           "plans":{},
           "status": "success",
           "metrics": {
                   "elapsedTime": "138.660107ms",
                   "executionTime": "134.309116ms",
                   "compileTime": "43.763632ms",
                   "queueWaitTime": "0ns",
                   "resultCount": 1,
                   "resultSize": 111,
                   "processedObjects": 1
           }
   }
   disheng@LAPTOP-UPFH5KC9:~/asterixdb$ curl --data-urlencode "statement=USE 
test; SELECT text FROM tweets;" \
        "http://localhost:19002/query/service?format=csv&header=absent";
   {
           "requestID": "855ebc1a-6816-49ef-bbcb-7e99fe6e5ef0",
           "type": "text/csv; header=absent",
           "signature": {
                   "*": "*"
           },
           "results": [ "@ScapegoatHelp Walked out on being scapegoated again. 
Saw Narcs mask slip & that sneer. No more 💪🦋" ]
           ,
           "plans":{},
           "status": "success",
           "metrics": {
                   "elapsedTime": "34.516791ms",
                   "executionTime": "32.945963ms",
                   "compileTime": "11.91599ms",
                   "queueWaitTime": "1ms",
                   "resultCount": 1,
                   "resultSize": 102,
                   "processedObjects": 1
           }
   }
   ```
   
   </details>
   
   ## JIRA Issue 
   https://issues.apache.org/jira/browse/ASTERIXDB-2877


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to