[ 
https://issues.apache.org/jira/browse/HADOOP-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Ratan updated HADOOP-1758:
--------------------------------

    Attachment: 1758_01.patch

Attached patch (1758_01.patch). while this fix is much more efficient than the 
current code, it is not the most efficient. We read data from a stream into a 
temporary string, then append to our string. A more efficient solution would be 
as EricW provided in an earlier comment. However, the solution in the patch is 
only slightly more inefficient, but is a lot simpler to read and understand, 
and it also mirrors the Java code. These are both important, IMO. If ever an 
extra string copy does indeed cause performance problems, we can change the 
code. 

> processing escapes in a jute record is quadratic
> ------------------------------------------------
>
>                 Key: HADOOP-1758
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1758
>             Project: Hadoop
>          Issue Type: Bug
>          Components: record
>    Affects Versions: 0.13.0
>            Reporter: Dick King
>            Assignee: Vivek Ratan
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: 1758_01.patch
>
>
> The following code appears in hadoop/src/c++/librecordio/csvarchive.cc :
> static void replaceAll(std::string s, const char *src, char c)
> {
>   std::string::size_type pos = 0;
>   while (pos != std::string::npos) {
>     pos = s.find(src);
>     if (pos != std::string::npos) {
>       s.replace(pos, strlen(src), 1, c);
>     }
>   }
> }
> This is used in the context of replacing jute escapes in the code:
> void hadoop::ICsvArchive::deserialize(std::string& t, const char* tag)
> {
>   t = readUptoTerminator(stream);
>   if (t[0] != '\'') {
>     throw new IOException("Errror deserializing string.");
>   }
>   t.erase(0, 1); /// erase first character
>   replaceAll(t, "%0D", 0x0D);
>   replaceAll(t, "%0A", 0x0A);
>   replaceAll(t, "%7D", 0x7D);
>   replaceAll(t, "%00", 0x00);
>   replaceAll(t, "%2C", 0x2C);
>   replaceAll(t, "%25", 0x25);
> }
> Since this replaces the entire string for each instance of the escape 
> sequence, practically anything would be better.  I would propose that within 
> deserialize we allocate a char * [since each replacement is smaller than the 
> original], scan for each %, and either do a general hex conversion in place 
> or look for one of the six patterns, and after each replacement move down the 
> unmodified text and scan for the % fom that starting point.
> -dk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to