http://searchhub.org/2013/07/12/commercial-vs-open-source-language-packs-an-interview-with-andrew-paulsen-regional-director-at-basis-technology/?goback=%2Egde_2474044_member_260003095

Share this:

Commercial vs. Open Source Language Packs: An Interview with Andrew Paulsen, 
Regional Director at Basis Technology

Today we’re talking with Andrew Paulsen of Basis Technology about their 
commercial language packs. 

Since Basis first came to market, open source has made huge strides forward in 
supporting multiple languages. Not only does Solr support many European 
languages, but it also has multiple options for Japanese and Chinese, including 
morphological tokenization. But despite all this progress, Basis is still 
around selling their wares. 

Why would anybody pay for software when open source alternatives exist, 
especially when using an open source search engine? What are the advantages to 
the commercial packages? And how is it that Basis is still in business!? Andrew 
was gracious enough to endure our interrogation! 


Hi Andrew, thanks for chatting with us. First of all, can you tell us a little 
bit about Basis, and maybe also a bit about your background? 

Sure. Basis Technology has been around for over 18 years providing text 
analytics software to some of the world’s most successful and innovative 
software companies such as Adobe, EMC, Google, Microsoft, HP, Salesforce.com, 
Oracle, Symantec and Yahoo!… to name just a few. I have worked out of our San 
Francisco offices since 2000. Yes, you heard correctly, 12 years which is a 
long time in the software industry. Over the years I have worked with numerous 
software companies (including all the aforementioned) improve both their web 
and enterprise search quality by implementing language specific linguistic 
support. 


Let’s get right into it, and I assume you get this question a lot. Solr now 
comes with a fair number of language specific analyzers, so why are companies 
still willing to purchase language packs from commercial vendors? 

Yes, very good question. One would think that our business would be shrinking, 
but it is actually growing considerably. I think that is because software 
companies are getting more sophisticated and can derive more value out of high 
quality linguistic analysis. Also, I think that companies are focusing more on 
markets outside of the US for growth and are putting more effort and resources 
to creating high quality support for search in foreign languages, as opposed to 
just creating checkbox features. 

To sum up our value proposition in relation to open source linguistics; we 
providing higher quality, more in-depth features, a wider breadth of language 
coverage and better performance/reliability. And as you know software engineers 
are expensive these days, especially search engineers with an NLP background. 
Companies can actually save money and increase development productivity by 
licensing a commercial ready NLP platform as opposed to having these well paid 
engineers implementing and testing various linguistic modules from around the 
world with various levels of quality and performance. 


For Western languages, one of the things Basis enables is lemmatization vs. 
simple stemming. This was something that FAST ESP used to talk about a lot, 
before being acquired by Microsoft. Can you tell us more about this, and maybe 
why it matters? Any examples? 

Sure… As a side note FAST was one of our customers. When Microsoft acquired 
FAST, many independent software vendors who licensed FAST moved over to 
SOLR/Lucene. These former FAST customers have high expectations for non-English 
search and so many of those customers licensed our software to integrate with 
their SOLR implementations. 

In regards to the benefits of lemmatization verses stemming; Stemming provides 
context insensitive algorithms that normalize tokens by chopping off the end of 
the words based upon rules. The resulting stem is not an actual word, but an 
artifact of what used to be a word. 

Example:
The word “babies” stems to “babie”
The word “copying” stems to “copi”
Lemmatization on the other hand is context sensitive and normalizes a word to 
its true dictionary form. 

Example:
The word “babies” lemmatizes to “baby”
The word “copying” lemmatizes to “copy”
Stemming can create a lot of problems such as different words creating the same 
stem or the same word creating different stems. There are also many instances 
where stemming flat out fails and does nothing to try and normalize the word. 
These types of problems and failures can create havoc for a company trying to 
provide high quality search. 

Stemming was originally developed to support English search and although there 
are some problems with English stemming, it generally works well for keyword 
search applications. The real problems come into play with European languages. 

European languages are highly inflected compared with English, meaning 
depending on the context the same word can be written many ways. Do you 
remember trying to conjugate Spanish in high school, or deal with masculine and 
feminine forms, etc… The more inflected a language is, the more important it is 
to provide morphological analysis to provide the correct lemma for indexing. 

Here is a whitepaper with extensive examples and explanations:
http://www.basistech.com/whitepapers/Enabling-High-Quality-Search-in-European-Languages-EN.pdf

When indexing text, the first step is to break it up into words. For Asian 
languages this is tough because they often don’t have spaces between words. In 
the early days of Solr, there was the primitive “ngram” based CJK module, which 
just chopped text into character pairs and had no concept of linguistic 
integrity. Obviously Basis was superior to that. But in recent years 
morphologic analyzers have been added for Chinese and Japanese, and it looks 
like Korean is in the works. How will you compete with these? 

Agreed, the open source analyzers have certainly gotten better, but so has our 
technology. So for example the SOLR Chinese analyzer only supports Simplified 
Chinese and does not support Traditional Chinese. This is a deal killer for 
many companies since supporting simplified Chinese is often a requirement. 

Also, we provide more in-depth features for companies that are serious about 
providing high quality search such as; a user dictionary, a de-compounding 
option, part of speech tagging, base noun phrase extraction, and for Chinese we 
also handle pinyin readings… We also provide various knobs and dials to fine 
tune the text processing to meet a company’s unique search requirements. These 
sets of features, which we refer to as Base Linguistics, are available across 
all the languages that Basis Technology’s Rosette supports. 

We’ve taken a look at the SOLR Japanese and it has a good set of features, it 
was created by a former FAST employee who is well respected in this space. 
Since the SOLR module for Korean isn’t available yet I cannot comment on that 
technology. 

But the real take away is this, there are about 20-25 commercially significant 
markets by language in the world. We currently provide linguistic analysis on 
over 40 languages where SOLR only supports a couple of languages with 
morphological analysis. What happens when you need quality search support for 
Russian, German, Spanish, Dutch, Danish, etc… 

One of the key values we bring is we handle all these languages at the highest 
linguistic quality standards. A Development or Product manager needs to ask 
himself, do I want my highly skilled (and expensive) engineers integrating, 
testing and supporting different pieces of linguistic software, developed by 
different groups/individuals with varying degrees of quality, performance and 
stability, when there is a commercial ready platform to take these tasks of 
their plate? 


Are there any specialized or niche languages where you have an advantage? 

It is true that Asian and Middle Eastern languages are more complex then 
European languages and need high quality linguistic processing to get high 
quality search results, so one might say this is our niche domain. But the fact 
is that European languages such as French, Italian, Spanish and German are 
highly inflected and require context sensitive morphological analysis. We are 
seeing very strong demand in Europe for our platform due to this fact. So my 
answer is that we have an advantage across the board. 


How is your software packaged? A simple jar with some config and dictionary 
files? 

The Rosette SDK also provides Entity Extraction which identifies entities such 
as people, places and organizations for over 20 languages. Entity Extraction 
can be used to implement what we refer to as “Discovery Search” features such 
as faceted navigation, trending terms, etc… And like our Base Linguistics, 
Entity Extraction also works out of the box with SOLR. Over the past year or so 
we have seen a significant increase in demand for Entity Extraction. This is an 
interesting topic and perhaps we can talk more about it in the near future. In 
the interim here is a white paper that may be of interest:
http://collateral.basistech.net/Whitepapers/Entity-Extraction-Enables-Discovery-EN.pdf

Do you use some type of horrible license manager that makes distributed 
installations a huge hassle? 

Our licensing mechanisms are basically invisible to developers and do not 
inhibit distributed installations in any manner. The licenses are not 
restricted by CPU’s’ or Throughput. The software resides on our clients’ 
servers and never talks to Basis Technology. Our customers have complete 
control over the software and are not restricted technically in any manner. As 
we speak our software is installed and running on some of the largest 
distributed platforms on the planet. 


As you know, this interview is targeted at developers. If a coder would like to 
“kick the tires”, how do they get started? What can they download and try? And 
do you have config examples for Solr? 

Yes, we try to make this as easy as possible. The request form is located here: 
http://www.basistech.com/text-analytics/requests/evaluation-request.html 

Yes, Yes I know developers don’t like forms. But the form is only requires you 
basic contact information and the benefit is that we compile the software to 
the developers platform of choice and provide email support to any questions 
that arise. In order to provide the software, documentation and support we 
simply need to know who we are sending our software to. 

In regards to config examples for SOLR, yes absolutely. It is simply a matter 
of making a few changes to Solr’s schema file. Our documentation provides 
specific examples that illustrate the changes to the file schema.xml needed to 
enable our analyzers. We also include sources for the Java code used to connect 
Solr to our linguistic modules for power users that want to augment what our 
Solr connector already does. 


Andrew, if folks have read this far, then maybe you’ve piqued their interest. 
So much does this all cost? How many $0′s? And if I’m a startup company without 
much cash, is there any point to even trying it? 

Our pricing is based on how and where it is being deployed, meaning that 
smaller companies pay significantly less for our software then say Microsoft, 
Apple or Google. We work with plenty of small startups. See here:
http://www.basistech.com/customers

This is interesting stuff, thanks Andrew! For the techies out there, were can 
they get more technical info? And presumably somebody would eventually need to 
talk to a salesperson, what’s that link? 

We are pretty open with our documentation and software. If readers are 
interested they can start here:
http://www.basistech.com/resources
If you have specific questions, yes you might have to eventually talk to a 
salesperson, but our sales people are very technical (not pushy) and will be 
more than happy to put you in contact directly with one of our developers. And 
after all, they are really nice people (including me).

This entry was posted in Interview, Technical Article by Mark Bennett. Bookmark 
the permalink.

-- 
You received this message because you are subscribed to the Google Groups 
"OpenEgypt" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


رد على