On Thu, Apr 26, 2012 at 8:45 AM, Nick Wellnhofer <[email protected]> wrote:
> I just published a compiled extension for Lucy on Github:
>
> https://github.com/nwellnhof/LucyX-Analysis-WhitespaceTokenizer
Awesome.
After a couple tweaks, I got it to build and pass its test suite. :)
> It's a simple whitespace tokenizer that's not meant to be used in production
> but to serve as a sample extension for development Here are some notes on
> stuff that's still to do:
>
> Currently, we use the last component of the module name as parcel. This
> results in very long symbol names in the case of WhitespaceTokenizer.
> We should add a "parcel" build parameter to Clownfish::CFC::Perl::Build, so
> we can use something shorter like "WSToker".
The Clownfish header language currently supports such a feature in its parcel
declarations:
parcel KinoSearch cnick Kino;
We had exactly the same issue that you're dealing with when the library had
the 10-letter name "KinoSearch" instead of the 4-letter name "Lucy". :)
See the patch below my signature for the impact of making such a change.
I'm not happy with that API for specifying parcel nicknames, though; it was
just enough to get the job done at the time, and parcels really need to
support a wide variety of properties.
My preferred mechanism now would be to have dedicated Clownfish parcel files
with ".cfp" extensions, encoded as JSON. Something like this, in the file
"core/WhitespaceTokenizer.cfp":
{
"parcel": "com::github::nwellnhof::WhitespaceTokenizer",
"version": "v0.1.0"
"nickname": "WSToker",
"requires": {
"org::apache::lucy::Lucy": "v0.4.0"
}
}
> In WhitespaceTokenizer.cfh I had to add a __C__ block that includes
> Lucy/Analysis/Inversion.h because the generated XS needs the LUCY_INVERSION
> VTable. That's not ideal.
Agreed. IMO what it should look like instead is either something like this...
parcel WhitespaceTokenizer;
use Lucy::Analysis::Tokenizer from org::apache::lucy::Lucy v0.4;
... or like this:
parcel WhitespaceTokenizer;
use org::apache::lucy::Lucy v0.4;
> As previously mentioned, all Lucy types used in WhitespaceTokenizer.cfh have
> to be prefixed with "lucy_".
Yes. This is high on my TODO list, but right this moment I have my teeth in a
potential solution for the fragile member var ABI issue. :)
> There's an intricate problem with XSLoader that only manifests when running
> the tests. See the comment in WhitespaceTokenizer.pm.
Sheesh, how goofy. I hope we find a fix for that eventually, so that
nobody else has to go down the lengthy debugging path you must have gone down.
Here are the two tweaks I had to make to build and pass tests on OS X:
* Hack in a Clownfish include directory manually for when Clownfish::CFC and
Lucy are not installed into $Config{installsitearch}.
* Comment out the "extra_linker_flags" argument in Build.PL.
On OS X, you get this error with you try to link against Lucy.bundle:
ld: in
/Users/marvin/Desktop/scratchlib/lib/perl5/darwin-2level/auto/Lucy/Lucy.bundle,
can't link with bundle (MH_BUNDLE) only dylibs (MH_DYLIB)
collect2: ld returned 1 exit status
Here's the explanation:
http://stackoverflow.com/questions/2339679/what-are-the-differences-between-so-and-dylib-on-osx
One Mach-O feature that hits many people by surprise is the distinction
between shared libraries and dynamically loadable modules. On ELF systems
both are the same; any piece of shared code can be used as a library and
for dynamic loading.
This leads me to question whether extensions should be linking against the
shared objects of their dependencies on *any* platform. Does that create a
tight binding that will e.g. break WhitespaceTokenizer when Lucy gets updated?
I'm pretty sure we want runtime relocation of upstream symbols.
> It's very illustrative to look at code that's created in autogen when
> building the extension, especially autogen/source/parcel.c.
Cool -- and parcel.c for WhitespaceTokenizer is only 133 lines long, as
opposed to over 33000 for all of Lucy!
Marvin Humphrey
--- a/core/LucyX/Analysis/WhitespaceTokenizer.c
+++ b/core/LucyX/Analysis/WhitespaceTokenizer.c
@@ -1,7 +1,7 @@
-#define WHITESPACETOKENIZER_USE_SHORT_NAMES
+#define WSTOKER_USE_SHORT_NAMES
#define LUCY_USE_SHORT_NAMES
-#define C_WHITESPACETOKENIZER_WHITESPACETOKENIZER
+#define C_WSTOKER_WHITESPACETOKENIZER
#define C_LUCY_TOKEN
#include "Lucy/Util/ToolSet.h"
@@ -13,7 +13,7 @@
#include <ctype.h>
void
-whitespacetokenizer_init_parcel() {
+wstoker_init_parcel() {
}
WhitespaceTokenizer*
diff --git a/core/LucyX/Analysis/WhitespaceTokenizer.cfh
b/core/LucyX/Analysis/WhitespaceTokenizer.cfh
index 78bc2c9..cab1fce 100644
--- a/core/LucyX/Analysis/WhitespaceTokenizer.cfh
+++ b/core/LucyX/Analysis/WhitespaceTokenizer.cfh
@@ -1,4 +1,4 @@
-parcel WhitespaceTokenizer;
+parcel WhitespaceTokenizer cnick WSToker;
__C__
#include "Lucy/Analysis/Inversion.h"
diff --git a/perl/buildlib/Binding.pm b/perl/buildlib/Binding.pm
index e9ad98b..820be33 100644
--- a/perl/buildlib/Binding.pm
+++ b/perl/buildlib/Binding.pm
@@ -12,7 +12,7 @@ sub bind_all {
MODULE = LucyX::Analysis::WhitespaceTokenizer PACKAGE =
LucyX::Analysis::WhitespaceTokenizer
BOOT:
- whitespacetokenizer_LucyX__Analysis__WhitespaceTokenizer_bootstrap();
+ wstoker_LucyX__Analysis__WhitespaceTokenizer_bootstrap();
END_XS_CODE
my $pod_spec = Clownfish::CFC::Binding::Perl::Pod->new;