On 6/21/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
On Jun 20, 2006, at 11:29 PM, David Balmain wrote:
> This seems like a lot to bundle to me, when like you said, it will
> probably be available on all platforms that Lucy might target. I don't
> see the problem with calling back to the native API. We are going to
> have to provide call-backs for things like memory allocation and
> exception handling so I don't think an extra inflate and deflate
> callback is going to hurt. But if you feel strongly about this I'm not
> to fussed.
I could be persuaded. Callbacks to Perl from C are verbose and kind
of hard to get your head around, so I have something of an
instinctive aversion to them.
Here's a wrapper for compress() in Perl:
use Compress::Zlib qw( compress )
sub compress_it {
my $input = shift;
return compress($input);
}
... and here's the equivalent function rendered in XS, calling back
to Perl..
SV*
compress_it(SV *input)
{
SV *retval;
dSP; /* declare stack pointer */
ENTER; /* opening bracket for a callback */
SAVETMPS; /* opening bracket for temporaries */
PUSHMARK(SP); /* start arg stack */
XPUSHs( sv_2mortal( newSVsv(input) ) ); /* pass copy of input */
PUTBACK; /* close arg stack */
call_pv("Compress::Zlib::compress", G_SCALAR); /* invoke compress
() */
FREETMPS; /* closing bracket for temporaries */
LEAVE; /* closing bracket for a callback */
retval = sv_2mortal( newSVsv( ST(0) ) ); /* copy first item on
stack */
return retval;
}
Untested. ;)
Ok, so it's a little easier for me. But it's only two methods,
compress and decompress. I don't think it will be too hard to do.
<snip>
> Do you plan on
> doing any other analysis at the C level or do you just want to make
> the SnowBall parser available in the target API?
I think we'll want to render TokenBatch in C and make the Snowball
Stemmer able to act on the TokenBatch's member strings directly. If
I have to call back to Perl, I'll have to wrap token text in a Perl
scalar then recover it back into the TokenBatch, and it won't be as
efficient -- more copy ops.
I'm not sure how the other Analyzers will be implemented.
The reason I asked is that if we were going to implement Analyzers at
the C level then we are going to have to worry about character
encoding. Unfortunately this is a reality for me in Ferret since there
is no way to lowercase utf-8 strings in Ruby yet. (Hopefully utf-8
support will be coming soon). I think for Lucy we should probably
leave the analysis to the target language, at least to start with.
> What is the byte buffer for in particular?
KinoSearch does a lot of serialization and deserialization. It's
really handy to have a string that knows its own length when you're
doing stuff like concat and truncate ops all the time.
The external sorter takes ByteBuffer's as its args. Without that it
would have to take char* and a string length at the same time. It
would get really messy. Think qsort with strings that may contain
null bytes.
Ok, that makes sense.