Hi Charles,

Very cool plugin!

My knowledge in this area is a bit sketchy… That said, the problem appears to 
be that the code does not extend the Drillbuf to ensure it has sufficient 
capacity. Try calling this method: reallocIfNeeded, something like this:

       this.buffer.reallocIfNeeded(stringLength);
       this.buffer.setBytes(0, bytes, 0, stringLength);
       map.varChar(fieldName).writeVarChar(0, stringLength, buffer);

Then, comment out the 256 length hack and see if it works.

To avoid memory fragmentation, maybe change your loop as:

            int maxRecords = MAX_RECORDS_PER_BATCH;
            int maxWidth = 256;
            while(recordCount < maxRecords &&(line = this.reader.readLine()) != 
null){
            …
               if(stringLength > maxWidth) {
                  maxWidth = stringLength;
                  maxRecords = 16 * 1024 * 1024 / maxWidth;
               }

The above is not perfect (the last record added might be much larger than the 
others, causing the corresponding vector to grow larger than 16 MB, but the 
occasional large vector should be OK.)

Thanks,

- Paul

On Jan 26, 2017, at 5:31 PM, Charles Givre 
<[email protected]<mailto:[email protected]>> wrote:

Hi Paul,
Would you mind taking a look at my code?  I’m wondering if I’m doing this 
correctly.  Just for context, I’m working on a generic log file reader for 
drill (https://github.com/cgivre/drill-logfile-plugin 
<https://github.com/cgivre/drill-logfile-plugin>), and I encountered some 
errors when working with fields that were > 256 characters long.  It isn’t a 
storage plugin, but it extends the EasyFormatPlugin.

I added some code to truncate the strings to 256 chars, it worked.  Before this 
it was throwing errors as shown below:



Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))

Fragment 0:0

[Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on 
charless-mbp-2.fios-router.home:31010] (state=,code=0)


The query that generated this was just a SELECT * FROM dfs.`file`.  Also, how 
do I set the size of each row batch?
Thank you for your help.
— C


if (m.find()) {
   for( int i = 1; i <= m.groupCount(); i++ )
   {
       //TODO Add option for date fields
       String fieldName  = fieldNames.get(i - 1);
       String fieldValue;

       fieldValue = m.group(i);

       if( fieldValue == null){
           fieldValue = "";
       }
       byte[] bytes = fieldValue.getBytes("UTF-8");

//Added this and it worked….
       int stringLength = bytes.length;
       if( stringLength > 256 ){
           stringLength = 256;
       }

       this.buffer.setBytes(0, bytes, 0, stringLength);
       map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
   }




On Jan 26, 2017, at 20:20, Paul Rogers 
<[email protected]<mailto:[email protected]>> wrote:

Hi Charles,

The Varchar column can hold any length of data. We’ve recently been working on 
tests that have columns up to 8K in length.

The one caveat is that, when working with data larger than 256 bytes, you must 
be extremely careful in your reader. The out-of-box text reader will always 
read 64K rows. This (due to various issues) can cause memory fragmentation and 
OOM errors when used with columns greater than 256 bytes in width.

If you are developing your own storage plugin, then adjust the size of each row 
batch so that no single vector is larger than 16 MB in size. Then you can use 
any size of column.

Suppose your logs contain text lines up to, say, 1K in size. This means that 
each record batch your reader produces must be of size less than 16 MB / 1K / 
row = 1600 rows (rather than the usual 64K.)

Once the data is in the Varchar column, the rest of Drill should “just work” on 
that data.

- Paul

On Jan 26, 2017, at 4:11 PM, Charles Givre 
<[email protected]<mailto:[email protected]>> wrote:

I’m working on a plugin to read log files and the data has some long strings.  
Is there a data type that can hold strings longer than 256 characters?
Thanks,
— Charles



Reply via email to