Re: [jira] [Commented] (UIMA-2128) Support to for gzipped XMI files

Marshall Schor Thu, 14 Jul 2011 19:32:58 -0700

there seem to be lots of zip implementations - another one for instance is
7-zip.  I haven't studied this issue enough to have a real opinion, but if zips
are implemented, I wonder if it would be good to have some kind of a pluggable
mechanism to allow for different zips for different circumstances.


-Marshall

On 7/14/2011 4:32 PM, Greg Holmberg (JIRA) wrote:
>     [ 
> https://issues.apache.org/jira/browse/UIMA-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065516#comment-13065516
>  ] 
>
> Greg Holmberg commented on UIMA-2128:
> -------------------------------------
>
> I'm not sure where in the code it should be implemented (which seems to be 
> Jörn's point), but another technology option is EXI, a binary encoding for 
> XML.  See http://www.w3.org/XML/EXI 
>
> I've experimented with this to encode XMI between a UIMA process and a 
> database on separate machine (both ends Java).  I used a commercial 
> implementation, Efficient XML from AgileDelta, and did some throughput 
> measurements, comparing it to gzipped XMI XML.  I found that EXI produces 
> somewhat smaller data than gzipped XML text (maybe 10% or 20% smaller, if I 
> remember correctly).  The biggest benefit to EXI was the amount of CPU time 
> required to read and write.  It was quite a bit faster than gzip to generate 
> the XML and parse the XML.  This is probably because it does so directly from 
> the ContentHandler, whereas with gzipped text, you first have to write the 
> text and then compress it.  Also, I suppose it's just more efficient to parse 
> a binary format than to step through characters looking for certain tokens.
>
> In my case, the improved throughput and reduction of CPU usage was most 
> important on the receiving end (i.e. the database) since it is a central 
> bottle-neck in the overall landscape of my system.  As the number of UIMA 
> senders increases, the database reaches it's limits to handle more messages 
> (no more CPU or NIC capacity on that machine).  So it was important to me to 
> be as efficient as possible with the XMI parsing on that machine in order to 
> minimize my hardware costs.
>
> In my case, all annotators are local, but a similar bottle-neck situation 
> could arise in UIMA AS if you have a remote annotator (service).  Then, 
> making that UIMA processor as efficient as possible in terms of both CPU and 
> network bandwidth usage becomes important.  GZip will help a lot compared to 
> plain text, but EXI is even better, especially to reduce the CPU usage of XML 
> generating and parsing, but also somewhat on the network bandwidth.
>
> Some open-source implementations of EXI are listed here: 
> http://en.wikipedia.org/wiki/Efficient_XML_Interchange
>
> I also tried the open-source Java implementation, EXIficient.  In early 2010 
> it implemented the standard technically correctly, but was immature, slow 
> (really, really slow!), and used a lot of memory.  However, it's been a year 
> since, so maybe it's improved since then.  I talked to the developers (from 
> Siemens) about their use of the GPL license, and they were not interested in 
> changing to an Apache-compatible license, so that may be an issue for use in 
> UIMA.
>
> I have not tried the other open-source Java implementation, OpenEXI. This 
> uses the Apache license though.  There's some discussion of gzipped XML text 
> versus EXI here: http://openexi.sourceforge.net/#whynotgzip 
>
> There's also an open-source C implementation, called EXIP.  I don't know 
> anything about it.
>
>
>> Support to for gzipped XMI files
>> --------------------------------
>>
>>                 Key: UIMA-2128
>>                 URL: https://issues.apache.org/jira/browse/UIMA-2128
>>             Project: UIMA
>>          Issue Type: Wish
>>          Components: CasEditor
>>            Reporter: Richard Eckart de Castilho
>>
>> Since XMI files tend to grow rather rapidly, it would be great if the CAS 
>> Editor supported to read and write gzipped XMI files (.xmi.gz).
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>        
>

Re: [jira] [Commented] (UIMA-2128) Support to for gzipped XMI files

Reply via email to