Hi Charles,
Very cool plugin!
My knowledge in this area is a bit sketchy… That said, the problem appears to
be that the code does not extend the Drillbuf to ensure it has sufficient
capacity. Try calling this method: reallocIfNeeded, something like this:
this.buffer.reallocIfNeeded(stringLength);
this.buffer.setBytes(0, bytes, 0, stringLength);
map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
Then, comment out the 256 length hack and see if it works.
To avoid memory fragmentation, maybe change your loop as:
int maxRecords = MAX_RECORDS_PER_BATCH;
int maxWidth = 256;
while(recordCount < maxRecords &&(line = this.reader.readLine()) !=
null){
…
if(stringLength > maxWidth) {
maxWidth = stringLength;
maxRecords = 16 * 1024 * 1024 / maxWidth;
}
The above is not perfect (the last record added might be much larger than the
others, causing the corresponding vector to grow larger than 16 MB, but the
occasional large vector should be OK.)
Thanks,
- Paul
On Jan 26, 2017, at 5:31 PM, Charles Givre
<[email protected]<mailto:[email protected]>> wrote:
Hi Paul,
Would you mind taking a look at my code? I’m wondering if I’m doing this
correctly. Just for context, I’m working on a generic log file reader for
drill (https://github.com/cgivre/drill-logfile-plugin
<https://github.com/cgivre/drill-logfile-plugin>), and I encountered some
errors when working with fields that were > 256 characters long. It isn’t a
storage plugin, but it extends the EasyFormatPlugin.
I added some code to truncate the strings to 256 chars, it worked. Before this
it was throwing errors as shown below:
Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))
Fragment 0:0
[Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on
charless-mbp-2.fios-router.home:31010] (state=,code=0)
The query that generated this was just a SELECT * FROM dfs.`file`. Also, how
do I set the size of each row batch?
Thank you for your help.
— C
if (m.find()) {
for( int i = 1; i <= m.groupCount(); i++ )
{
//TODO Add option for date fields
String fieldName = fieldNames.get(i - 1);
String fieldValue;
fieldValue = m.group(i);
if( fieldValue == null){
fieldValue = "";
}
byte[] bytes = fieldValue.getBytes("UTF-8");
//Added this and it worked….
int stringLength = bytes.length;
if( stringLength > 256 ){
stringLength = 256;
}
this.buffer.setBytes(0, bytes, 0, stringLength);
map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
}
On Jan 26, 2017, at 20:20, Paul Rogers
<[email protected]<mailto:[email protected]>> wrote:
Hi Charles,
The Varchar column can hold any length of data. We’ve recently been working on
tests that have columns up to 8K in length.
The one caveat is that, when working with data larger than 256 bytes, you must
be extremely careful in your reader. The out-of-box text reader will always
read 64K rows. This (due to various issues) can cause memory fragmentation and
OOM errors when used with columns greater than 256 bytes in width.
If you are developing your own storage plugin, then adjust the size of each row
batch so that no single vector is larger than 16 MB in size. Then you can use
any size of column.
Suppose your logs contain text lines up to, say, 1K in size. This means that
each record batch your reader produces must be of size less than 16 MB / 1K /
row = 1600 rows (rather than the usual 64K.)
Once the data is in the Varchar column, the rest of Drill should “just work” on
that data.
- Paul
On Jan 26, 2017, at 4:11 PM, Charles Givre
<[email protected]<mailto:[email protected]>> wrote:
I’m working on a plugin to read log files and the data has some long strings.
Is there a data type that can hold strings longer than 256 characters?
Thanks,
— Charles