Thanks!  I’m hoping to submit a PR eventually once I have this all done.  I 
tried your changes and now I’m getting this error:

0: jdbc:drill:zk=local> select * from dfs.client.`small.misolog`;
Error: DATA_READ ERROR: Tried to remove unmanaged buffer.

Fragment 0:0

[Error Id: 52fc846a-1d94-4300-bcb4-7000d0949b3c on 
charless-mbp-2.fios-router.home:31010] (state=,code=0)




> On Jan 26, 2017, at 23:08, Paul Rogers <[email protected]> wrote:
> 
> Hi Charles,
> 
> Very cool plugin!
> 
> My knowledge in this area is a bit sketchy… That said, the problem appears to 
> be that the code does not extend the Drillbuf to ensure it has sufficient 
> capacity. Try calling this method: reallocIfNeeded, something like this:
> 
>       this.buffer.reallocIfNeeded(stringLength);
>       this.buffer.setBytes(0, bytes, 0, stringLength);
>       map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
> 
> Then, comment out the 256 length hack and see if it works.
> 
> To avoid memory fragmentation, maybe change your loop as:
> 
>            int maxRecords = MAX_RECORDS_PER_BATCH;
>            int maxWidth = 256;
>            while(recordCount < maxRecords &&(line = this.reader.readLine()) 
> != null){
>            …
>               if(stringLength > maxWidth) {
>                  maxWidth = stringLength;
>                  maxRecords = 16 * 1024 * 1024 / maxWidth;
>               }
> 
> The above is not perfect (the last record added might be much larger than the 
> others, causing the corresponding vector to grow larger than 16 MB, but the 
> occasional large vector should be OK.)
> 
> Thanks,
> 
> - Paul
> 
> On Jan 26, 2017, at 5:31 PM, Charles Givre 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Hi Paul,
> Would you mind taking a look at my code?  I’m wondering if I’m doing this 
> correctly.  Just for context, I’m working on a generic log file reader for 
> drill (https://github.com/cgivre/drill-logfile-plugin 
> <https://github.com/cgivre/drill-logfile-plugin>), and I encountered some 
> errors when working with fields that were > 256 characters long.  It isn’t a 
> storage plugin, but it extends the EasyFormatPlugin.
> 
> I added some code to truncate the strings to 256 chars, it worked.  Before 
> this it was throwing errors as shown below:
> 
> 
> 
> Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))
> 
> Fragment 0:0
> 
> [Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on 
> charless-mbp-2.fios-router.home:31010] (state=,code=0)
> 
> 
> The query that generated this was just a SELECT * FROM dfs.`file`.  Also, how 
> do I set the size of each row batch?
> Thank you for your help.
> — C
> 
> 
> if (m.find()) {
>   for( int i = 1; i <= m.groupCount(); i++ )
>   {
>       //TODO Add option for date fields
>       String fieldName  = fieldNames.get(i - 1);
>       String fieldValue;
> 
>       fieldValue = m.group(i);
> 
>       if( fieldValue == null){
>           fieldValue = "";
>       }
>       byte[] bytes = fieldValue.getBytes("UTF-8");
> 
> //Added this and it worked….
>       int stringLength = bytes.length;
>       if( stringLength > 256 ){
>           stringLength = 256;
>       }
> 
>       this.buffer.setBytes(0, bytes, 0, stringLength);
>       map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>   }
> 
> 
> 
> 
> On Jan 26, 2017, at 20:20, Paul Rogers 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Hi Charles,
> 
> The Varchar column can hold any length of data. We’ve recently been working 
> on tests that have columns up to 8K in length.
> 
> The one caveat is that, when working with data larger than 256 bytes, you 
> must be extremely careful in your reader. The out-of-box text reader will 
> always read 64K rows. This (due to various issues) can cause memory 
> fragmentation and OOM errors when used with columns greater than 256 bytes in 
> width.
> 
> If you are developing your own storage plugin, then adjust the size of each 
> row batch so that no single vector is larger than 16 MB in size. Then you can 
> use any size of column.
> 
> Suppose your logs contain text lines up to, say, 1K in size. This means that 
> each record batch your reader produces must be of size less than 16 MB / 1K / 
> row = 1600 rows (rather than the usual 64K.)
> 
> Once the data is in the Varchar column, the rest of Drill should “just work” 
> on that data.
> 
> - Paul
> 
> On Jan 26, 2017, at 4:11 PM, Charles Givre 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> I’m working on a plugin to read log files and the data has some long strings. 
>  Is there a data type that can hold strings longer than 256 characters?
> Thanks,
> — Charles
> 
> 
> 

Reply via email to