Yeah, the default in ruby-marc for encodings that _aren't_ MARC8 are to ignore bad bytes entirely -- leave them in the MARC::Record as bad bytes. This is likely end up raising an exception later when you try to DO something with those Strings, but was left this way for backwards compatiblity reasons.

You can optionally tell ruby-marc to raise or 'fix' these bad bytes instead, but the default is to leave them alone.

However, that's not really possible for MARC8->UTF8 conversion. Since a conversion is going on, bad bytes can't be 'left alone', something has to be done with them -- raise or replace.

My question here is solely about MARC8->UTF8 conversion, I am not changing anything else about the ruby-marc API at this time.

"I think raising an exception is fine, as long as we can still continue
to walk the records with the reader." Honestly, I'm not sure if that's true, I'm not sure how easy it's going to be to continue iterating through the records after an exception, I think the exception gets raised in a place that leaves the reader inconsistent. If so, there may not be any easy way to fix that. Bah. Scott, you want to beta test this new version of ruby-marc?

At any rate, pull requests always welcome once it gets released, having some MARC8->UTF8 conversion seems an improvement even if the details aren't right. We've always placed a premium on backwards compat in ruby-marc though, so I wanted to try and avoid making api/default choices we'd later regret but not want to change for backwards compat.


On 11/20/13 11:40 AM, Scott Prater wrote:
Not sure what the details of our issue was on Monday -- but we do have
records that are supposedly encoded in UTF-8, but nonetheless contain
invalid characters.

I think raising an exception is fine, as long as we can still continue
to walk the records with the reader.  The right thing for application
code to do then would be to catch the exception, log it, and continue to
the next record.  The more information in the exception, the better.

-- Scott

I am not sure how you ran into this problem on Monday with ruby-marc,
since ruby-marc doesn't currently handle Marc8 conversion to UTF-8 at
all -- how could you have run into a problem with Marc8 to UTF8
conversion?  But that is what I am adding.

But yeah, using a preprocessor is certainly one option, that will not be
taken away from people. Although hopefully adding Marc8->UTF8 conversion
to ruby-marc might remove the need for a preprocessor in many cases.

So again, we have a bit of a paradox, that I have in my own head too.
Scot suggests that "In either case, what we DON'T want is to halt the
processing altogether."  And yet, still, that the default behavior
should be raising an exception -- that, is halting processing
altogether, right?

So hardly anyone hardly ever is going to want the default behavior, but
everyone thinks it should be default anyway, to force people to realize
what they're doing? I am not entirely objecting to that -- it's why I
brought it up here, but it does seem odd, doesn't it?  To say something
should be default that hardly anyone hardly ever will want?


On 11/20/13 10:10 AM, Scott Prater wrote:
We run into this problem fairly regularly, and in fact, ran into it on
Monday with ruby-marc.

The way we've traditionally handled it is to put our marc stream through
a cleanup preprocessor before passing it off to a marc parser (ruby marc
or marc4j).

The preprocessor can do one of two things:

   1)  Skip the bad record in the marc stream and move on; or
   2)  Substitute the bad characters with some default character, and
write it out.

In both cases we log the error as a warning, and include a byte offset
where the bad character occurs, and the record ID, if possible.  This
allows us to go back and fix the errors in a stream in a batch;
generally, the bad encoding errors fall into four or five common errors
(cutting and pasting data from Windows is a typical cause).

In either case, what we DON'T want is to halt the processing altogether.
  Generally, we're dealing with thousands, sometimes millions, of MARC
records in a stream;  it's very frustrating to get halfway through the
stream, then have the parser throw an exception and halt.  Halting the
processing should be the strategy of last resort, to be called only when
the stream has become so corrupted you can't go on to the next record.

I'd want the default to be option 1.  Let the user determine what
changes need to be made to the data;  the parser's job is to parse, not
infer and create.  Overwriting data could also lead to the misperception
that everything is okay, when it really isn't.

-- Scott

On 11/20/2013 08:32 AM, Jon Stroop wrote:
Coming from nowhere on this...is there a place where it would be
convenient to flag which behavior the user (of the library) wants? I
think you're correct that most of the time you'd just want to blow
through it (or replace it), but for the situation where this isn't the
case, I think the Right Thing to do is raise the exception. I don't
think you would want to bury it in some assumption made internal to the
library unless that assumption can be turned off.

-Jon


On 11/19/2013 07:51 PM, Jonathan Rochkind wrote:
ruby-marc users, a question.

I am working on some Marc8 to UTF-8 conversion for ruby-marc.

Sometimes, what appears to be an illegal byte will appear in the Marc8
input, and it can not be converted to UTF8.

The software will support two alternatives when this happens: 1)
Raising an exception. 2) Replacing the illegal byte with a replacement
char and/or omitting it.

I feel like most of the time, users are going to want #2.  I know
that's what I'm going to want nearly all the time.

Yet, still, I am feeling uncertain whether that should be the default.
Which should be the default behavior, #1 or #2?  If most people most
of the time are going to want #2 (is this true?), then should that be
the default behavior?   Or should #1 still be the default behavior,
because by default bad input should raise, not be silently recovered
from, even though most people most of the time won't want that, heh.

Jonathan




Reply via email to