Hi Paul, 
Would you mind taking a look at my code?  I’m wondering if I’m doing this 
correctly.  Just for context, I’m working on a generic log file reader for 
drill (https://github.com/cgivre/drill-logfile-plugin 
<https://github.com/cgivre/drill-logfile-plugin>), and I encountered some 
errors when working with fields that were > 256 characters long.  It isn’t a 
storage plugin, but it extends the EasyFormatPlugin. 

I added some code to truncate the strings to 256 chars, it worked.  Before this 
it was throwing errors as shown below:



Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))

Fragment 0:0

[Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on 
charless-mbp-2.fios-router.home:31010] (state=,code=0)


The query that generated this was just a SELECT * FROM dfs.`file`.  Also, how 
do I set the size of each row batch?
Thank you for your help.
— C


if (m.find()) {
    for( int i = 1; i <= m.groupCount(); i++ )
    {
        //TODO Add option for date fields
        String fieldName  = fieldNames.get(i - 1);
        String fieldValue;

        fieldValue = m.group(i);

        if( fieldValue == null){
            fieldValue = "";
        }
        byte[] bytes = fieldValue.getBytes("UTF-8");
        
        //Added this and it worked….
        int stringLength = bytes.length;
        if( stringLength > 256 ){
            stringLength = 256;
        }

        this.buffer.setBytes(0, bytes, 0, stringLength);
        map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
    }




> On Jan 26, 2017, at 20:20, Paul Rogers <[email protected]> wrote:
> 
> Hi Charles,
> 
> The Varchar column can hold any length of data. We’ve recently been working 
> on tests that have columns up to 8K in length.
> 
> The one caveat is that, when working with data larger than 256 bytes, you 
> must be extremely careful in your reader. The out-of-box text reader will 
> always read 64K rows. This (due to various issues) can cause memory 
> fragmentation and OOM errors when used with columns greater than 256 bytes in 
> width.
> 
> If you are developing your own storage plugin, then adjust the size of each 
> row batch so that no single vector is larger than 16 MB in size. Then you can 
> use any size of column.
> 
> Suppose your logs contain text lines up to, say, 1K in size. This means that 
> each record batch your reader produces must be of size less than 16 MB / 1K / 
> row = 1600 rows (rather than the usual 64K.)
> 
> Once the data is in the Varchar column, the rest of Drill should “just work” 
> on that data.
> 
> - Paul
> 
>> On Jan 26, 2017, at 4:11 PM, Charles Givre <[email protected]> wrote:
>> 
>> I’m working on a plugin to read log files and the data has some long 
>> strings.  Is there a data type that can hold strings longer than 256 
>> characters?
>> Thanks,
>> — Charles
> 

Reply via email to