processing escapes in a jute record is quadratic
------------------------------------------------
Key: HADOOP-1758
URL: https://issues.apache.org/jira/browse/HADOOP-1758
Project: Hadoop
Issue Type: Bug
Components: record
Affects Versions: 0.13.0
Reporter: Dick King
Priority: Blocker
The following code appears in hadoop/src/c++/librecordio/csvarchive.cc :
static void replaceAll(std::string s, const char *src, char c)
{
std::string::size_type pos = 0;
while (pos != std::string::npos) {
pos = s.find(src);
if (pos != std::string::npos) {
s.replace(pos, strlen(src), 1, c);
}
}
}
This is used in the context of replacing jute escapes in the code:
void hadoop::ICsvArchive::deserialize(std::string& t, const char* tag)
{
t = readUptoTerminator(stream);
if (t[0] != '\'') {
throw new IOException("Errror deserializing string.");
}
t.erase(0, 1); /// erase first character
replaceAll(t, "%0D", 0x0D);
replaceAll(t, "%0A", 0x0A);
replaceAll(t, "%7D", 0x7D);
replaceAll(t, "%00", 0x00);
replaceAll(t, "%2C", 0x2C);
replaceAll(t, "%25", 0x25);
}
Since this replaces the entire string for each instance of the escape sequence,
practically anything would be better. I would propose that within deserialize
we allocate a char * [since each replacement is smaller than the original],
scan for each %, and either do a general hex conversion in place or look for
one of the six patterns, and after each replacement move down the unmodified
text and scan for the % fom that starting point.
-dk
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.