[OpenEgypt:1897] Commercial vs. Open Source Language Packs: An Interview with Andrew Paulsen, Regional Director at Basis Technology | SearchHub | Lucene/Solr Open Source Search

Ahmed ElHefnawy Wed, 31 Jul 2013 17:10:16 -0700

http://searchhub.org/2013/07/12/commercial-vs-open-source-language-packs-an-interview-with-andrew-paulsen-regional-director-at-basis-technology/?goback=%2Egde_2474044_member_260003095

Share this:

Commercial vs. Open Source Language Packs: An Interview with Andrew Paulsen,
Regional Director at Basis Technology

Today we’re talking with Andrew Paulsen of Basis Technology about their
commercial language packs.

Since Basis first came to market, open source has made huge strides forward in
supporting multiple languages. Not only does Solr support many European
languages, but it also has multiple options for Japanese and Chinese, including
morphological tokenization. But despite all this progress, Basis is still
around selling their wares.

Why would anybody pay for software when open source alternatives exist,
especially when using an open source search engine? What are the advantages to
the commercial packages? And how is it that Basis is still in business!? Andrew
was gracious enough to endure our interrogation!

Hi Andrew, thanks for chatting with us. First of all, can you tell us a little
bit about Basis, and maybe also a bit about your background?

Sure. Basis Technology has been around for over 18 years providing text
analytics software to some of the world’s most successful and innovative
software companies such as Adobe, EMC, Google, Microsoft, HP, Salesforce.com,
Oracle, Symantec and Yahoo!… to name just a few. I have worked out of our San
Francisco offices since 2000. Yes, you heard correctly, 12 years which is a
long time in the software industry. Over the years I have worked with numerous
software companies (including all the aforementioned) improve both their web
and enterprise search quality by implementing language specific linguistic
support.

Let’s get right into it, and I assume you get this question a lot. Solr now
comes with a fair number of language specific analyzers, so why are companies
still willing to purchase language packs from commercial vendors?

Yes, very good question. One would think that our business would be shrinking,
but it is actually growing considerably. I think that is because software
companies are getting more sophisticated and can derive more value out of high
quality linguistic analysis. Also, I think that companies are focusing more on
markets outside of the US for growth and are putting more effort and resources
to creating high quality support for search in foreign languages, as opposed to
just creating checkbox features.

To sum up our value proposition in relation to open source linguistics; we
providing higher quality, more in-depth features, a wider breadth of language
coverage and better performance/reliability. And as you know software engineers
are expensive these days, especially search engineers with an NLP background.
Companies can actually save money and increase development productivity by
licensing a commercial ready NLP platform as opposed to having these well paid
engineers implementing and testing various linguistic modules from around the
world with various levels of quality and performance.

For Western languages, one of the things Basis enables is lemmatization vs.
simple stemming. This was something that FAST ESP used to talk about a lot,
before being acquired by Microsoft. Can you tell us more about this, and maybe
why it matters? Any examples?

Sure… As a side note FAST was one of our customers. When Microsoft acquired
FAST, many independent software vendors who licensed FAST moved over to
SOLR/Lucene. These former FAST customers have high expectations for non-English
search and so many of those customers licensed our software to integrate with
their SOLR implementations.

In regards to the benefits of lemmatization verses stemming; Stemming provides
context insensitive algorithms that normalize tokens by chopping off the end of
the words based upon rules. The resulting stem is not an actual word, but an
artifact of what used to be a word.

Example:
The word “babies” stems to “babie”
The word “copying” stems to “copi”
Lemmatization on the other hand is context sensitive and normalizes a word to
its true dictionary form.

Example:
The word “babies” lemmatizes to “baby”
The word “copying” lemmatizes to “copy”
Stemming can create a lot of problems such as different words creating the same
stem or the same word creating different stems. There are also many instances
where stemming flat out fails and does nothing to try and normalize the word.
These types of problems and failures can create havoc for a company trying to
provide high quality search.

Stemming was originally developed to support English search and although there
are some problems with English stemming, it generally works well for keyword
search applications. The real problems come into play with European languages.

European languages are highly inflected compared with English, meaning
depending on the context the same word can be written many ways. Do you
remember trying to conjugate Spanish in high school, or deal with masculine and
feminine forms, etc… The more inflected a language is, the more important it is
to provide morphological analysis to provide the correct lemma for indexing.

Here is a whitepaper with extensive examples and explanations:
http://www.basistech.com/whitepapers/Enabling-High-Quality-Search-in-European-Languages-EN.pdf

When indexing text, the first step is to break it up into words. For Asian
languages this is tough because they often don’t have spaces between words. In
the early days of Solr, there was the primitive “ngram” based CJK module, which
just chopped text into character pairs and had no concept of linguistic
integrity. Obviously Basis was superior to that. But in recent years
morphologic analyzers have been added for Chinese and Japanese, and it looks
like Korean is in the works. How will you compete with these?

Agreed, the open source analyzers have certainly gotten better, but so has our
technology. So for example the SOLR Chinese analyzer only supports Simplified
Chinese and does not support Traditional Chinese. This is a deal killer for
many companies since supporting simplified Chinese is often a requirement.

Also, we provide more in-depth features for companies that are serious about
providing high quality search such as; a user dictionary, a de-compounding
option, part of speech tagging, base noun phrase extraction, and for Chinese we
also handle pinyin readings… We also provide various knobs and dials to fine
tune the text processing to meet a company’s unique search requirements. These
sets of features, which we refer to as Base Linguistics, are available across
all the languages that Basis Technology’s Rosette supports.

We’ve taken a look at the SOLR Japanese and it has a good set of features, it
was created by a former FAST employee who is well respected in this space.
Since the SOLR module for Korean isn’t available yet I cannot comment on that
technology.

But the real take away is this, there are about 20-25 commercially significant
markets by language in the world. We currently provide linguistic analysis on
over 40 languages where SOLR only supports a couple of languages with
morphological analysis. What happens when you need quality search support for
Russian, German, Spanish, Dutch, Danish, etc…

One of the key values we bring is we handle all these languages at the highest
linguistic quality standards. A Development or Product manager needs to ask
himself, do I want my highly skilled (and expensive) engineers integrating,
testing and supporting different pieces of linguistic software, developed by
different groups/individuals with varying degrees of quality, performance and
stability, when there is a commercial ready platform to take these tasks of
their plate?

Are there any specialized or niche languages where you have an advantage?

It is true that Asian and Middle Eastern languages are more complex then
European languages and need high quality linguistic processing to get high
quality search results, so one might say this is our niche domain. But the fact
is that European languages such as French, Italian, Spanish and German are
highly inflected and require context sensitive morphological analysis. We are
seeing very strong demand in Europe for our platform due to this fact. So my
answer is that we have an advantage across the board.

How is your software packaged? A simple jar with some config and dictionary
files?

The Rosette SDK also provides Entity Extraction which identifies entities such
as people, places and organizations for over 20 languages. Entity Extraction
can be used to implement what we refer to as “Discovery Search” features such
as faceted navigation, trending terms, etc… And like our Base Linguistics,
Entity Extraction also works out of the box with SOLR. Over the past year or so
we have seen a significant increase in demand for Entity Extraction. This is an
interesting topic and perhaps we can talk more about it in the near future. In
the interim here is a white paper that may be of interest:
http://collateral.basistech.net/Whitepapers/Entity-Extraction-Enables-Discovery-EN.pdf

Do you use some type of horrible license manager that makes distributed
installations a huge hassle?

Our licensing mechanisms are basically invisible to developers and do not
inhibit distributed installations in any manner. The licenses are not
restricted by CPU’s’ or Throughput. The software resides on our clients’
servers and never talks to Basis Technology. Our customers have complete
control over the software and are not restricted technically in any manner. As
we speak our software is installed and running on some of the largest
distributed platforms on the planet.

As you know, this interview is targeted at developers. If a coder would like to
“kick the tires”, how do they get started? What can they download and try? And
do you have config examples for Solr?

Yes, we try to make this as easy as possible. The request form is located here:
http://www.basistech.com/text-analytics/requests/evaluation-request.html

Yes, Yes I know developers don’t like forms. But the form is only requires you
basic contact information and the benefit is that we compile the software to
the developers platform of choice and provide email support to any questions
that arise. In order to provide the software, documentation and support we
simply need to know who we are sending our software to.

In regards to config examples for SOLR, yes absolutely. It is simply a matter
of making a few changes to Solr’s schema file. Our documentation provides
specific examples that illustrate the changes to the file schema.xml needed to
enable our analyzers. We also include sources for the Java code used to connect
Solr to our linguistic modules for power users that want to augment what our
Solr connector already does.

Andrew, if folks have read this far, then maybe you’ve piqued their interest.
So much does this all cost? How many $0′s? And if I’m a startup company without
much cash, is there any point to even trying it?

Our pricing is based on how and where it is being deployed, meaning that
smaller companies pay significantly less for our software then say Microsoft,
Apple or Google. We work with plenty of small startups. See here:
http://www.basistech.com/customers

This is interesting stuff, thanks Andrew! For the techies out there, were can
they get more technical info? And presumably somebody would eventually need to
talk to a salesperson, what’s that link?

We are pretty open with our documentation and software. If readers are
interested they can start here:
http://www.basistech.com/resources
If you have specific questions, yes you might have to eventually talk to a
salesperson, but our sales people are very technical (not pushy) and will be
more than happy to put you in contact directly with one of our developers. And
after all, they are really nice people (including me).

This entry was posted in Interview, Technical Article by Mark Bennett. Bookmark
the permalink.

--
You received this message because you are subscribed to the Google Groups
"OpenEgypt" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

[OpenEgypt:1897] Commercial vs. Open Source Language Packs: An Interview with Andrew Paulsen, Regional Director at Basis Technology | SearchHub | Lucene/Solr Open Source Search

رد على