Rainer Klute wrote:

On Wed, 2004-06-23 at 14:22, Sergiu Gordea wrote:


Thanks for advice, but if you are talking about the code listed below, I can tell you it is not working:



Before looking at the code in detail: What do you mean by "is not
working"?



Hi Rainer,

 I revised the code and I modified a little bit.

One problem was that the FileOutputStream was not closed and nothing was
written in the file.
The second one was that i was not printed correctly the yrray of bytes.
Anyway some problems are not solved yet:

1.I used in my code:
    if(!event.getName().equalsIgnoreCase("PowerPoint Document"))
              return;

   because I want to read only the text existing in the document, and
I'm not interested about the other elements:
    drawings, images, headers etc.

    I don't like that hard codded "PowerPoint Document". Does anyone
know how can I replace this condition with something more general?

2. I have a master template and therefore (I think) I have something
like this in the resulting text :

  Klicken Sie, um die Formate des Vorlagentextes zu bearbeiten
Zweite Ebene djflk sdlkfj sdlfkj sdlkfj sdlkfj sdlfkjs dflskjdf slkjd fslk
Dritte Ebene djflk sdlkfj sdlfkj sdlkfj sdlkfj sdlfkjs dflskjdf slkjd
fslk h fkjh kjh dskjh dkjh
Vierte Ebene
Detail 1
Detail 2
Detail 3   Titel   INFORMATICS
DEPARTMENTS   University of Klagenfurt   Klicken Sie, um das Titelformat
zu bearbeiten   Klicken Sie, um das Untertitelformat zu bearbeiten
Textmasterformate durch Klicken bearbeiten
Zweite Ebene
Dritte Ebene
Vierte Ebene

3. Special Characters are not extracted correctly and also CR should be
replaced from resulting text:
� (f�nfte)
F?nfte Ebene   PROM Server   PROM Plug-in


I'm not familiar at all with the format of msoffice documents and I just downloaded POI library, but I would like to create some classes to extract the text from word, ppt and xls documents.

I got an example and I already created a class to get the text from XLS
documents.
Is anyone interested to help me to create this Text Extractors so that
we can provide some
help classes that can be used toghether with lucene to index office
documents.

 Thanks for understanding,

 Sergiu Gordea

PS: plese excuse the mess from the code in the attached source, that's
just a test class.







Mit freundlichen Gr��en
Rainer Klute

                          Rainer Klute IT-Consulting GmbH
 Dipl.-Inform.
 Rainer Klute             E-Mail:  [EMAIL PROTECTED]
 K�rner Grund 24          Telefon: +49 172 2324824
D-44143 Dortmund           Telefax: +49 231 5349423


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]





Thu Jun 24 13:52:57 CEST 2004Knowledge Management in Software 
EngineeringAutomatically Knowledge Acquisition about 
Software Development Process  


Sergiu Gordea                                              Klagenfurt 17. May. 
2004Data, Information, KnowledgeData:
- consists of discrete, objective facts about events 
- essential raw material for the creation of information
Information:
- a message, usually in the form of document or in audible/visible communication; it 
has a sender and a receiver 
- data + meaning 
Knowledge:
- is broader than information, it is information combined with experience, 
interpretation and reflection
- should contain information about information, like classification and meta-data 
(creator, when and where, context etc. )
Knowledge Life-CycleProblems of  KM in SETacit Knowledge:
 expert networks
 expensive 
 requires friendly environment for knowledge sharing
 lack of motivation in Software Development Companies (SDC)
Explicit Knowledge:
 the creation of explicit knowledge is not on the list of priorities in SDC
 each SDC manages their knowledge somehow
 it is hard to prove the outcome obtained from the usage of Knowledge Management System

Solution: Automatically Knowledge Acquisition ! Knowledge about SDP 
 current status of the development process
 post mortem analysis of the projects
 the quality of developed software


 new information and artifacts created during the development process Measures of 
SDP  System Architecture Fig.2 System Architecture
System Modules
System ModulesProm Trace:
 registers the time spent and the title of the active window
WebMetrics:
 WebMetrics is an extensible tool able to extract code metrics from source code files
Code metrics: Lines Of Code (LOC), McCabe Cyclomatic Complexity, Halstead Volume, 
Fan-In and Fan-Out
OOM: Weighted Methods per Class (WMC), Depth of Inheritance Tree (DIT), Number Of 
Children (NOC), Coupling Between Object Classes (CBO), Response For a Class (RFC), 
Lack Of Cohesion in Methods (LCOM) 


Open Issues
 discover and eliminate the noise from the data collected by PROM Tool 
 find correlation between the collected data and the software metrics
 draw conclusions and generate knowledge automatically (or semi-automatically)
 complete the set of collected metrics with other relevant metrics (ex. Design & 
Quality)
 discover/predict modules with high fault density
 discover modules with bad design
 localize the parts of code with bad design Klicken Sie, um die Formate des 
Vorlagentextes zu bearbeiten
Zweite Ebene djflk sdlkfj sdlfkj sdlkfj sdlkfj sdlfkjs dflskjdf slkjd fslk 
Dritte Ebene djflk sdlkfj sdlfkj sdlkfj sdlkfj sdlfkjs dflskjdf slkjd fslk h fkjh kjh 
dskjh dkjh 
Vierte Ebene
Detail 1
Detail 2
Detail 3TitelINFORMATICS
DEPARTMENTSUniversity of KlagenfurtKlicken Sie, um das Titelformat zu 
bearbeitenKlicken Sie, um das Untertitelformat zu bearbeitenTextmasterformate 
durch Klicken bearbeiten
Zweite Ebene
Dritte Ebene
Vierte Ebene
F?nfte EbenePROM ServerPROM Plug-inPROM TraceWeb MetricsReport 
GenerationProm Plug-in:
 registers the time spent by developers in each method 
Available for JBuilder, Eclipse, IDEA, MS Office, OpenOffice 
/* @(#) CWK 1.5 21.06.2004
 * 
 * Copyright 2003-2005 ConfigWorks Informationssysteme & Consulting GmbH
 * Universit�tsstr. 94/7 9020 Klagenfurt Austria
 * www.configworks.com
 * All rights reserved.
 */
package com.configworks.cwk.be.search.converters;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Date;
import org.apache.poi.hpsf.PropertySet;
import org.apache.poi.hpsf.PropertySetFactory;
import org.apache.poi.poifs.eventfilesystem.POIFSReader;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderListener;
import org.apache.poi.util.LittleEndian;


public class Ppt2Txt
{
        public static void main(String[] args)
        throws IOException{
                //String filename = args[0];
                File src = new File("e:\\test.ppt");
                File dest = new File("e:\\ppt.txt");
                
                convertPpt(src, dest);
        }

        public static void convertPpt(File src, File dest) throws IOException, 
FileNotFoundException {
                POIFSReader r = new POIFSReader();

                /* Register a listener for *all* documents. */
                MyPOIFSReaderListener listener = new MyPOIFSReaderListener(
                                new BufferedWriter(new FileWriter(dest)));

                r.registerListener(listener);
                r.read(new FileInputStream(src));
        }
        
        
}


         class MyPOIFSReaderListener implements POIFSReaderListener{
                private BufferedWriter writer = null;
                
                public MyPOIFSReaderListener(BufferedWriter writer){
                        this.writer = writer;
                }

                public void processPOIFSReaderEvent(POIFSReaderEvent event) {
                        PropertySet ps = null;

                        try{
                                
                                org.apache.poi.poifs.filesystem.DocumentInputStream 
dis=null;
                                if(!event.getName().equalsIgnoreCase("PowerPoint 
Document"))
                                        return;
                                
                                System.out.println("\n\n");
                                System.out.println(event.getPath()+event.getName());
                                dis=event.getStream();
                                
                                 byte btoWrite[]= new byte[12];
                                 dis.read(btoWrite);
//                               
System.out.println("Version:"+LittleEndian.getUnsignedByte(btoWrite,0));
//                               
System.out.println("Instance:"+LittleEndian.getUShort(btoWrite,0));
//                               
System.out.println("Type:"+LittleEndian.getUShort(btoWrite,2));
//                               
System.out.println("Len:"+LittleEndian.getLong(btoWrite,4));

                                
                                btoWrite = new byte[dis.available()];
                                dis.read(btoWrite, 0, dis.available());
                                
                                String date = (new Date()).toString();
                                System.out.println(date);
                                writer.write(date);
                                
                                StringBuffer buff = new StringBuffer("");
                                
                                for(int i=0; i<btoWrite.length-20; i++){
                                        //System.out.println("Version 
:"+LittleEndian.getUnsignedByte(btoWrite,i+0));
                                        //System.out.println("Instance 
:"+LittleEndian.getUShort(btoWrite,i+0));
                                        
//System.out.println("Type:"+LittleEndian.getUShort(btoWrite,i+2));
                                        
//System.out.println("Len:"+LittleEndian.getUInt(btoWrite,i+4));

                                        long type=LittleEndian.getUShort(btoWrite,i+2);
                                        long size=LittleEndian.getUInt(btoWrite,i+4);
                                        if (type==4008){
                                                
                                                
//fos.write(btoWrite,i+4+1,(int)size+3);
                                                int offset = i+4+1;
                                                int length = (int)size+3;
                                                int end = offset + length;
                                                
                                                byte[] textBytes = new byte[length]; 
                                                
                                                for (int j = offset; j < end; j++) {
                                                        //eliminate special chars, add 
spaces otherwisethe words will be concatenated 
                                                        byte b = btoWrite[j];
                                                        writer.write((char) b);
                                                        
                                                        //b = handleSpecialChars(b);
                                                        //textBytes [j - offset] = b;
                                                        //buff.append((char) b);
                                                
                                                }

                                                /*String s = new String(textBytes);
                                                s = handleText(s);
                                                writer.write(s);
                                                */
                                                if(i < (end -1))
                                                        i = end -1;
                                        }
                                        
                                }
                                
                                writer.close();
                                
                                System.out.println("gata");
                                //ps = 
                                PropertySetFactory.create(event.getStream());
                        }catch (Exception ex){
                                //System.out.println("No property set stream: \"" + 
event.getPath() +
                                //      event.getName() + "\"");
                                //System.out.println(ex);
                                return;
                        }
                }

                private byte handleSpecialChars(byte b) {
                        if(b < 32 && b != 10 && b != 9 && b != 13 )
                                b = 32;
                        return b;
                }       
        }




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to