Hi Paul,
Would you mind taking a look at my code? I’m wondering if I’m doing this
correctly. Just for context, I’m working on a generic log file reader for
drill (https://github.com/cgivre/drill-logfile-plugin
<https://github.com/cgivre/drill-logfile-plugin>), and I encountered some
errors when working with fields that were > 256 characters long. It isn’t a
storage plugin, but it extends the EasyFormatPlugin.
I added some code to truncate the strings to 256 chars, it worked. Before this
it was throwing errors as shown below:
Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))
Fragment 0:0
[Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on
charless-mbp-2.fios-router.home:31010] (state=,code=0)
The query that generated this was just a SELECT * FROM dfs.`file`. Also, how
do I set the size of each row batch?
Thank you for your help.
— C
if (m.find()) {
for( int i = 1; i <= m.groupCount(); i++ )
{
//TODO Add option for date fields
String fieldName = fieldNames.get(i - 1);
String fieldValue;
fieldValue = m.group(i);
if( fieldValue == null){
fieldValue = "";
}
byte[] bytes = fieldValue.getBytes("UTF-8");
//Added this and it worked….
int stringLength = bytes.length;
if( stringLength > 256 ){
stringLength = 256;
}
this.buffer.setBytes(0, bytes, 0, stringLength);
map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
}
> On Jan 26, 2017, at 20:20, Paul Rogers <[email protected]> wrote:
>
> Hi Charles,
>
> The Varchar column can hold any length of data. We’ve recently been working
> on tests that have columns up to 8K in length.
>
> The one caveat is that, when working with data larger than 256 bytes, you
> must be extremely careful in your reader. The out-of-box text reader will
> always read 64K rows. This (due to various issues) can cause memory
> fragmentation and OOM errors when used with columns greater than 256 bytes in
> width.
>
> If you are developing your own storage plugin, then adjust the size of each
> row batch so that no single vector is larger than 16 MB in size. Then you can
> use any size of column.
>
> Suppose your logs contain text lines up to, say, 1K in size. This means that
> each record batch your reader produces must be of size less than 16 MB / 1K /
> row = 1600 rows (rather than the usual 64K.)
>
> Once the data is in the Varchar column, the rest of Drill should “just work”
> on that data.
>
> - Paul
>
>> On Jan 26, 2017, at 4:11 PM, Charles Givre <[email protected]> wrote:
>>
>> I’m working on a plugin to read log files and the data has some long
>> strings. Is there a data type that can hold strings longer than 256
>> characters?
>> Thanks,
>> — Charles
>