[ https://issues.apache.org/jira/browse/HADOOP-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521950 ]
Milind Bhandarkar commented on HADOOP-1758: ------------------------------------------- Dick, EricW provided this replacement for hadoop::ICsvArchive::deserialize, which he said worked great for him. Can you try it out ? {code} void hadoop::ICsvArchive::deserialize(std::string& t, const char* tag) { char c; if (1 != stream.read(&c, 1)) { throw new IOException("Error in deserialization."); } if (c != '\'') { throw new IOException("Errror deserializing string."); } while (1) { char c; if (1 != stream.read(&c, 1)) { throw new IOException("Error in deserialization."); } if (c == ',' || c == '\n' || c == '}') { if (c != ',') { stream.pushBack(c); } break; } else if (c == '%') { char d[2]; if (2 != stream.read(&d, 2)) { throw new IOException("Error in deserialization."); } if (strncmp(d, "0D", 2) == 0) { t.push_back(0x0D); } else if (strncmp(d, "0A", 2) == 0) { t.push_back(0x0A); } else if (strncmp(d, "7D", 2) == 0) { t.push_back(0x7D); } else if (strncmp(d, "00", 2) == 0) { t.push_back(0x00); } else if (strncmp(d, "2C", 2) == 0) { t.push_back(0x2C); } else if (strncmp(d, "25", 2) == 0) { t.push_back(0x25); } else { t.push_back(c); t.push_back(d[1]); t.push_back(d[2]); } } else { t.push_back(c); } } } {code} > processing escapes in a jute record is quadratic > ------------------------------------------------ > > Key: HADOOP-1758 > URL: https://issues.apache.org/jira/browse/HADOOP-1758 > Project: Hadoop > Issue Type: Bug > Components: record > Affects Versions: 0.13.0 > Reporter: Dick King > Priority: Blocker > > The following code appears in hadoop/src/c++/librecordio/csvarchive.cc : > static void replaceAll(std::string s, const char *src, char c) > { > std::string::size_type pos = 0; > while (pos != std::string::npos) { > pos = s.find(src); > if (pos != std::string::npos) { > s.replace(pos, strlen(src), 1, c); > } > } > } > This is used in the context of replacing jute escapes in the code: > void hadoop::ICsvArchive::deserialize(std::string& t, const char* tag) > { > t = readUptoTerminator(stream); > if (t[0] != '\'') { > throw new IOException("Errror deserializing string."); > } > t.erase(0, 1); /// erase first character > replaceAll(t, "%0D", 0x0D); > replaceAll(t, "%0A", 0x0A); > replaceAll(t, "%7D", 0x7D); > replaceAll(t, "%00", 0x00); > replaceAll(t, "%2C", 0x2C); > replaceAll(t, "%25", 0x25); > } > Since this replaces the entire string for each instance of the escape > sequence, practically anything would be better. I would propose that within > deserialize we allocate a char * [since each replacement is smaller than the > original], scan for each %, and either do a general hex conversion in place > or look for one of the six patterns, and after each replacement move down the > unmodified text and scan for the % fom that starting point. > -dk -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.