[jira] [Updated] (ODFTOOLKIT-308) GSoC: ODF Command Line Tools

Rob Weir (Updated) (JIRA) Mon, 19 Mar 2012 14:30:06 -0700

     [ 
https://issues.apache.org/jira/browse/ODFTOOLKIT-308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rob Weir updated ODFTOOLKIT-308:
--------------------------------

    Description: 
==Background on our open source project==

The Apache ODF Toolkit is a set of Java modules that allow programmatic 
creation, scanning and manipulation of Open Document Format (ISO/IEC 26300 == 
ODF) documents. Unlike other approaches which rely on runtime manipulation of 
heavy-weight editors via an automation interface, the ODF Toolkit is 
lightweight and ideal for server use. 

http://incubator.apache.org/odftoolkit/index.html

==The Idea==

GNU/Linux, and UNIX before then has shown the great power of a text processing 
via simple command line tools, combined with operating facilities for piping 
and redirection. This filter-baed text processing is what makes shell 
programming so powerful.  But it only works well for pure text documents.  But 
what about more complex, WYSIWYG documents, spreadsheets, word processors, with 
more complex formats?  The existing tool set becomes far weaker.

The Apache ODF Toolkit is a Java API that gives a high level view of a 
document, and enables programmatic manipulation of a document.  We have 
functions for doing things like search & replace, adding paragraphs, accessing 
cells in a spreadsheeting, etc., all from a Java application.  No traditional 
editors is involved.  Pure Java, stuff you could run on a server even.

You can look at our "cookbook" for examples of our "Simple API" in action:

http://incubator.apache.org/odftoolkit/simple/document/cookbook/index.html


There is a lot you can do using this API.  But it still requires Java 
programming, and that limits its reach to professional programmers.

What if we could write, using the ODF Toolkit, a set of command line utilities 
that made it easy to do both simple and complex text manipulation tasks form a 
command line, things like:

1) Concatenate documents
2) Replace slide 3 in presentation A with slide 3 from presentation B
3) Apply the styles of document A to all documents in the current directory
4) Find all occurrences of "sausages" in the given document and add a hyperlink 
to sausages.com

and so on.

The audience for such a tool could be:

1) Data wranglers, who want to extract information from a large number of ODF 
documents. 

2) Power users who want to automate some repetitive document automation tasks, 
like filling in a template,or an off-line mail merge

3) QA testers of office editors, who use simple scripts to generate test cases 
as well as to test editor-generated documents for correctness

4) Web developers who want to generate a data-driven document on-the-fly 

So think generally in that space. Not system programmers.  Not application 
developers.  But command line gurus, with a little scripting ability at most.  
That is the  "sweet spot".

Some technical aspects you might want to consider:

1)    The real value of the Unix text utilities is that they could easily be 
combined. For example, I recently did this to search for all openoffice.org 
email address on downloaded copy of the openoffice website, deduping and 
sorting by how many times each address appeared:


grep -o -r -i --no-filename --include="*.html" 
"[[:alnum:]+\.\_\-]*@openoffice.org" . | sort | uniq -c | sort -n -r

So, powerful command line tools that each do one thing well. And then a way to 
pipe the outputs of one to become the inputs of another.   Can we define a 
similar set of basic operations on ODF documents, as well as the glue to 
combine these commands into more powerful pipelines?


2) Useful example tools are cat, grep, diff and sed, etc.. Maybe even something 
awk-like that works with spreadsheets?  No need to be slavish to the original 
tools, but create something of similar power, but which operate on ODF 
documents.

3)  The trick will be that an ODF document is a ZIP file containing multiple 
XML files, and possibly other resources, like JPG images. If we pipe the binary 
ZIP, then we're forcing each tool in the chain to do the uncompress/compress, 
which is bad for performance. There is also the issue of repeated 
parsing/serialization of the XML.  So how can we do this all efficiently?  


Note:  These are just ideas to get you thinking in this general area. I would 
be pleased to review any GSoC proposals related to the ODF Toolkit.

  was:
GNU/Linux, and UNIX before then has shown the great power of a text processing 
via simple command line tools, combined with operating facilities for piping 
and redirection. This filter-baed text processing is what makes shell 
programming so powerful.  But it only works well for text documents.  But what 
about more complex, WYSIWYG documents, spreadsheets, word processors, with more 
complex formats, often not text based at all?  The tool set becomes far weaker.

The Apache ODF Toolkit is a Java API that gives a high level view of a 
document, and enables programmatic manipulation of a document.  We have 
functions for doing things like search & replace.  There is a lot you can do 
using the ODF Toolkit.  But it still requires Java programming, and that limits 
its reach to professional programmers.

What if we could write, using the ODF Toolkit, a set of command line utilities 
that made it easy to do both simple and complex text manipulation tasks form a 
command line, things like:

1) Concatenate documents
2) Replace slide 3 in presentation A with slide 3 from presentation B
3) Apply the styles of document A to all documents in the current directory
4) Find all occurances of "sausages" in the given document and add a hyperlink 
to sausages.com

and so on.

Clearly analogs of cat, grep, diff and sed are obvious ones. Maybe something 
awk-like that works with spreadsheets?  No need to be slavish to the original 
tools, but create something of similar power, but which operate on ODF 
documents.  For example, an alternative solution might be to write a new shell 
processor that has native commands for ODF document manipulation.

    
> GSoC:  ODF Command Line Tools
> -----------------------------
>
>                 Key: ODFTOOLKIT-308
>                 URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-308
>             Project: ODF Toolkit
>          Issue Type: New Feature
>            Reporter: Rob Weir
>            Assignee: Rob Weir
>              Labels: gsoc2012, mentor
>
> ==Background on our open source project==
> The Apache ODF Toolkit is a set of Java modules that allow programmatic 
> creation, scanning and manipulation of Open Document Format (ISO/IEC 26300 == 
> ODF) documents. Unlike other approaches which rely on runtime manipulation of 
> heavy-weight editors via an automation interface, the ODF Toolkit is 
> lightweight and ideal for server use. 
> http://incubator.apache.org/odftoolkit/index.html
> ==The Idea==
> GNU/Linux, and UNIX before then has shown the great power of a text 
> processing via simple command line tools, combined with operating facilities 
> for piping and redirection. This filter-baed text processing is what makes 
> shell programming so powerful.  But it only works well for pure text 
> documents.  But what about more complex, WYSIWYG documents, spreadsheets, 
> word processors, with more complex formats?  The existing tool set becomes 
> far weaker.
> The Apache ODF Toolkit is a Java API that gives a high level view of a 
> document, and enables programmatic manipulation of a document.  We have 
> functions for doing things like search & replace, adding paragraphs, 
> accessing cells in a spreadsheeting, etc., all from a Java application.  No 
> traditional editors is involved.  Pure Java, stuff you could run on a server 
> even.
> You can look at our "cookbook" for examples of our "Simple API" in action:
> http://incubator.apache.org/odftoolkit/simple/document/cookbook/index.html
> There is a lot you can do using this API.  But it still requires Java 
> programming, and that limits its reach to professional programmers.
> What if we could write, using the ODF Toolkit, a set of command line 
> utilities that made it easy to do both simple and complex text manipulation 
> tasks form a command line, things like:
> 1) Concatenate documents
> 2) Replace slide 3 in presentation A with slide 3 from presentation B
> 3) Apply the styles of document A to all documents in the current directory
> 4) Find all occurrences of "sausages" in the given document and add a 
> hyperlink to sausages.com
> and so on.
> The audience for such a tool could be:
> 1) Data wranglers, who want to extract information from a large number of ODF 
> documents. 
> 2) Power users who want to automate some repetitive document automation 
> tasks, like filling in a template,or an off-line mail merge
> 3) QA testers of office editors, who use simple scripts to generate test 
> cases as well as to test editor-generated documents for correctness
> 4) Web developers who want to generate a data-driven document on-the-fly 
> So think generally in that space. Not system programmers.  Not application 
> developers.  But command line gurus, with a little scripting ability at most. 
>  That is the  "sweet spot".
> Some technical aspects you might want to consider:
> 1)    The real value of the Unix text utilities is that they could easily be 
> combined. For example, I recently did this to search for all openoffice.org 
> email address on downloaded copy of the openoffice website, deduping and 
> sorting by how many times each address appeared:
> grep -o -r -i --no-filename --include="*.html" 
> "[[:alnum:]+\.\_\-]*@openoffice.org" . | sort | uniq -c | sort -n -r
> So, powerful command line tools that each do one thing well. And then a way 
> to pipe the outputs of one to become the inputs of another.   Can we define a 
> similar set of basic operations on ODF documents, as well as the glue to 
> combine these commands into more powerful pipelines?
> 2) Useful example tools are cat, grep, diff and sed, etc.. Maybe even 
> something awk-like that works with spreadsheets?  No need to be slavish to 
> the original tools, but create something of similar power, but which operate 
> on ODF documents.
> 3)  The trick will be that an ODF document is a ZIP file containing multiple 
> XML files, and possibly other resources, like JPG images. If we pipe the 
> binary ZIP, then we're forcing each tool in the chain to do the 
> uncompress/compress, which is bad for performance. There is also the issue of 
> repeated parsing/serialization of the XML.  So how can we do this all 
> efficiently?  
> Note:  These are just ideas to get you thinking in this general area. I would 
> be pleased to review any GSoC proposals related to the ODF Toolkit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ODFTOOLKIT-308) GSoC: ODF Command Line Tools

Reply via email to