Hi Owen,
Thanks for the quick response.
Essentially, I have an Avro -> ORC real-time conversion process I have. I
do the conversion myself using the Java API. In the case I (internally in
my code) hit a serialization failure, etc. then I push to a queue to handle
offline.
However, since I write the data for a single record column vector by column
vector, I want to make sure I don't have partial data from the failed
record still in the vector positions for that failed record.
Here is a small snippet to elucidate what I'm doing. *addToVector* could
fail for any sort of reason, so I track the failed avro record in a
separate thread, but want to make sure for that vectorPosition that the
other column vectors are reset? Maybe to defaults? Maybe it's a dumb
question, but I can't figure out a smart way to do that or if I'm thinking
about that rollback idea correctly. Hopefully that is clear. Thanks Owen!
for (int c = 0; c < batch.numCols; c++) {
ColumnVector colVector = batch.cols[c];
final String thisField = orcSchema.getFieldNames().get(c);
int vectorPosition = batch.size;
Logger.orcConversionStatus(LOGGER_TRACE_ID, CLASS_LOCATION,
String.format("Processing field: %s", thisField));
final TypeDescription type = orcSchema.getChildren().get(c);
Object fieldValue = record.get(thisField);
Schema.Field avroField = currSchema.getField(thisField);
// If this fails on some column X, I want to rollback the data I've
written for batch.numCols - X
addToVector(type, colVector, avroField.schema(), fieldValue, vectorPosition);
}
On Fri, Sep 11, 2020 at 10:37 AM Owen O'Malley <[email protected]>
wrote:
> Where is the failure happening? If it is happening in the ORC writer code,
> there isn't a way to do that. Can I ask what kind of exception you are
> hitting? In the column (aka tree) writers, there shouldn't be much that can
> go wrong. It doesn't even write to the file handle, just buffering in
> memory.
>
> If the problem is in your code, you should be able to use the selected
> vector in the VectorizedRowBatch to just select the other rows.
>
> .. Owen
>
> On Fri, Sep 11, 2020 at 7:12 AM Ryan Schachte <[email protected]>
> wrote:
>
> > I'm writing a streaming application that converts incoming data into ORC
> in
> > real-time. One thing I'm implementing is a dead-letter queue that still
> > allows me to continue the batch processing even if a single record fails.
> >
> > The caveat to this, is I want to remove the data that has been written
> thus
> > far if a failure occurs on say the 6th column out of 10 columns. For
> > example:
> >
> > I write the following data:
> >
> > {
> > firstName: blah1,
> > lastName: blah2,
> > otherData: blah3
> > }
> >
> > My question is, if I fail on otherData, I want to "rollback" the data
> from
> > the column vectors at the current vectorPosition I'm iterating on. Is it
> as
> > simple as setting colVector.isNull[vectorPosition] to true and setting
> > colVector.noNulls to false? I wanted to originally go into the index for
> > each column vector and override, but I don't see an easy way to do that.
> >
> > Cheers!!
> > Ryan Schachte
> >
>