[ https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973566#comment-13973566 ]
Jason Dere commented on HIVE-6843: ---------------------------------- Should this also work for unicode characters which require more than one Java character? If you add these checks to TestGenericUDFUtils, the 2nd check fails: {code} Assert.assertEquals(3, GenericUDFUtils.findText(new Text("123\uD801\uDC00456"), new Text("\uD801\uDC00"), 0)); Assert.assertEquals(4, GenericUDFUtils.findText(new Text("123\uD801\uDC00456"), new Text("4"), 0)); {code} This would require using String.codePointCount() on the indexOf() result. > INSTR for UTF-8 returns incorrect position > ------------------------------------------ > > Key: HIVE-6843 > URL: https://issues.apache.org/jira/browse/HIVE-6843 > Project: Hive > Issue Type: Bug > Components: UDF > Affects Versions: 0.11.0, 0.12.0 > Reporter: Clif Kranish > Assignee: Szehon Ho > Priority: Minor > Attachments: HIVE-6843.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)