Isn't gzip infamous for being slow? HBase is using LZO for this reason. Apache Thrift
also using something else.

I can recommend to use UIMA-AS and let it process data stored in HBase, we run a couple of UIMA-AS services on the same machines hosting Hadoop and got good results. Anyway since we are using OpenNLP, the bottleneck we hit is CPU power.

The first optimization I would do is to get data locality with UIMA-AS, then the
network bottleneck vanishes.

What kind of analysis do you run? Is it also CPU intensive?

Jörn

On 7/14/11 10:32 PM, Greg Holmberg (JIRA) wrote:
     [ 
https://issues.apache.org/jira/browse/UIMA-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065516#comment-13065516
 ]

Greg Holmberg commented on UIMA-2128:
-------------------------------------

I'm not sure where in the code it should be implemented (which seems to be 
Jörn's point), but another technology option is EXI, a binary encoding for XML. 
 See http://www.w3.org/XML/EXI

I've experimented with this to encode XMI between a UIMA process and a database 
on separate machine (both ends Java).  I used a commercial implementation, 
Efficient XML from AgileDelta, and did some throughput measurements, comparing 
it to gzipped XMI XML.  I found that EXI produces somewhat smaller data than 
gzipped XML text (maybe 10% or 20% smaller, if I remember correctly).  The 
biggest benefit to EXI was the amount of CPU time required to read and write.  
It was quite a bit faster than gzip to generate the XML and parse the XML.  
This is probably because it does so directly from the ContentHandler, whereas 
with gzipped text, you first have to write the text and then compress it.  
Also, I suppose it's just more efficient to parse a binary format than to step 
through characters looking for certain tokens.

In my case, the improved throughput and reduction of CPU usage was most 
important on the receiving end (i.e. the database) since it is a central 
bottle-neck in the overall landscape of my system.  As the number of UIMA 
senders increases, the database reaches it's limits to handle more messages (no 
more CPU or NIC capacity on that machine).  So it was important to me to be as 
efficient as possible with the XMI parsing on that machine in order to minimize 
my hardware costs.

In my case, all annotators are local, but a similar bottle-neck situation could 
arise in UIMA AS if you have a remote annotator (service).  Then, making that 
UIMA processor as efficient as possible in terms of both CPU and network 
bandwidth usage becomes important.  GZip will help a lot compared to plain 
text, but EXI is even better, especially to reduce the CPU usage of XML 
generating and parsing, but also somewhat on the network bandwidth.

Some open-source implementations of EXI are listed here: 
http://en.wikipedia.org/wiki/Efficient_XML_Interchange

I also tried the open-source Java implementation, EXIficient.  In early 2010 it 
implemented the standard technically correctly, but was immature, slow (really, 
really slow!), and used a lot of memory.  However, it's been a year since, so 
maybe it's improved since then.  I talked to the developers (from Siemens) 
about their use of the GPL license, and they were not interested in changing to 
an Apache-compatible license, so that may be an issue for use in UIMA.

I have not tried the other open-source Java implementation, OpenEXI. This uses 
the Apache license though.  There's some discussion of gzipped XML text versus 
EXI here: http://openexi.sourceforge.net/#whynotgzip

There's also an open-source C implementation, called EXIP.  I don't know 
anything about it.


Support to for gzipped XMI files
--------------------------------

                 Key: UIMA-2128
                 URL: https://issues.apache.org/jira/browse/UIMA-2128
             Project: UIMA
          Issue Type: Wish
          Components: CasEditor
            Reporter: Richard Eckart de Castilho

Since XMI files tend to grow rather rapidly, it would be great if the CAS 
Editor supported to read and write gzipped XMI files (.xmi.gz).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



Reply via email to