Rainer Klute wrote:
On Wed, 2004-06-23 at 14:22, Sergiu Gordea wrote:
Thanks for advice, but if you are talking about the code listed below, I
can tell you it is not working:
Before looking at the code in detail: What do you mean by "is not
working"?
Hi Rainer,
I revised the code and I modified a little bit.
One problem was that the FileOutputStream was not closed and nothing was
written in the file.
The second one was that i was not printed correctly the yrray of bytes.
Anyway some problems are not solved yet:
1.I used in my code:
if(!event.getName().equalsIgnoreCase("PowerPoint Document"))
return;
because I want to read only the text existing in the document, and
I'm not interested about the other elements:
drawings, images, headers etc.
I don't like that hard codded "PowerPoint Document". Does anyone
know how can I replace this condition with something more general?
2. I have a master template and therefore (I think) I have something
like this in the resulting text :
Klicken Sie, um die Formate des Vorlagentextes zu bearbeiten
Zweite Ebene djflk sdlkfj sdlfkj sdlkfj sdlkfj sdlfkjs dflskjdf slkjd fslk
Dritte Ebene djflk sdlkfj sdlfkj sdlkfj sdlkfj sdlfkjs dflskjdf slkjd
fslk h fkjh kjh dskjh dkjh
Vierte Ebene
Detail 1
Detail 2
Detail 3 Titel INFORMATICS
DEPARTMENTS University of Klagenfurt Klicken Sie, um das Titelformat
zu bearbeiten Klicken Sie, um das Untertitelformat zu bearbeiten
Textmasterformate durch Klicken bearbeiten
Zweite Ebene
Dritte Ebene
Vierte Ebene
3. Special Characters are not extracted correctly and also CR should be
replaced from resulting text:
� (f�nfte)
F?nfte Ebene PROM Server PROM Plug-in
I'm not familiar at all with the format of msoffice documents and I
just downloaded POI library, but I would like to
create some classes to extract the text from word, ppt and xls documents.
I got an example and I already created a class to get the text from XLS
documents.
Is anyone interested to help me to create this Text Extractors so that
we can provide some
help classes that can be used toghether with lucene to index office
documents.
Thanks for understanding,
Sergiu Gordea
PS: plese excuse the mess from the code in the attached source, that's
just a test class.
Mit freundlichen Gr��en
Rainer Klute
Rainer Klute IT-Consulting GmbH
Dipl.-Inform.
Rainer Klute E-Mail: [EMAIL PROTECTED]
K�rner Grund 24 Telefon: +49 172 2324824
D-44143 Dortmund Telefax: +49 231 5349423
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Thu Jun 24 13:52:57 CEST 2004 Knowledge Management in Software
Engineering Automatically Knowledge Acquisition about
Software Development Process
Sergiu Gordea Klagenfurt 17. May.
2004 Data, Information, Knowledge Data:
- consists of discrete, objective facts about events
- essential raw material for the creation of information
Information:
- a message, usually in the form of document or in audible/visible communication; it
has a sender and a receiver
- data + meaning
Knowledge:
- is broader than information, it is information combined with experience,
interpretation and reflection
- should contain information about information, like classification and meta-data
(creator, when and where, context etc. )
Knowledge Life-Cycle Problems of KM in SE Tacit Knowledge:
expert networks
expensive
requires friendly environment for knowledge sharing
lack of motivation in Software Development Companies (SDC)
Explicit Knowledge:
the creation of explicit knowledge is not on the list of priorities in SDC
each SDC manages their knowledge somehow
it is hard to prove the outcome obtained from the usage of Knowledge Management System
Solution: Automatically Knowledge Acquisition ! Knowledge about SDP
current status of the development process
post mortem analysis of the projects
the quality of developed software
new information and artifacts created during the development process Measures of
SDP System Architecture Fig.2 System Architecture
System Modules
System Modules Prom Trace:
registers the time spent and the title of the active window
WebMetrics:
WebMetrics is an extensible tool able to extract code metrics from source code files
Code metrics: Lines Of Code (LOC), McCabe Cyclomatic Complexity, Halstead Volume,
Fan-In and Fan-Out
OOM: Weighted Methods per Class (WMC), Depth of Inheritance Tree (DIT), Number Of
Children (NOC), Coupling Between Object Classes (CBO), Response For a Class (RFC),
Lack Of Cohesion in Methods (LCOM)
Open Issues
discover and eliminate the noise from the data collected by PROM Tool
find correlation between the collected data and the software metrics
draw conclusions and generate knowledge automatically (or semi-automatically)
complete the set of collected metrics with other relevant metrics (ex. Design &
Quality)
discover/predict modules with high fault density
discover modules with bad design
localize the parts of code with bad design Klicken Sie, um die Formate des
Vorlagentextes zu bearbeiten
Zweite Ebene djflk sdlkfj sdlfkj sdlkfj sdlkfj sdlfkjs dflskjdf slkjd fslk
Dritte Ebene djflk sdlkfj sdlfkj sdlkfj sdlkfj sdlfkjs dflskjdf slkjd fslk h fkjh kjh
dskjh dkjh
Vierte Ebene
Detail 1
Detail 2
Detail 3 Titel INFORMATICS
DEPARTMENTS University of Klagenfurt Klicken Sie, um das Titelformat zu
bearbeiten Klicken Sie, um das Untertitelformat zu bearbeiten Textmasterformate
durch Klicken bearbeiten
Zweite Ebene
Dritte Ebene
Vierte Ebene
F?nfte Ebene PROM Server PROM Plug-in PROM Trace Web Metrics Report
Generation Prom Plug-in:
registers the time spent by developers in each method
Available for JBuilder, Eclipse, IDEA, MS Office, OpenOffice
/* @(#) CWK 1.5 21.06.2004
*
* Copyright 2003-2005 ConfigWorks Informationssysteme & Consulting GmbH
* Universit�tsstr. 94/7 9020 Klagenfurt Austria
* www.configworks.com
* All rights reserved.
*/
package com.configworks.cwk.be.search.converters;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Date;
import org.apache.poi.hpsf.PropertySet;
import org.apache.poi.hpsf.PropertySetFactory;
import org.apache.poi.poifs.eventfilesystem.POIFSReader;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderListener;
import org.apache.poi.util.LittleEndian;
public class Ppt2Txt
{
public static void main(String[] args)
throws IOException{
//String filename = args[0];
File src = new File("e:\\test.ppt");
File dest = new File("e:\\ppt.txt");
convertPpt(src, dest);
}
public static void convertPpt(File src, File dest) throws IOException,
FileNotFoundException {
POIFSReader r = new POIFSReader();
/* Register a listener for *all* documents. */
MyPOIFSReaderListener listener = new MyPOIFSReaderListener(
new BufferedWriter(new FileWriter(dest)));
r.registerListener(listener);
r.read(new FileInputStream(src));
}
}
class MyPOIFSReaderListener implements POIFSReaderListener{
private BufferedWriter writer = null;
public MyPOIFSReaderListener(BufferedWriter writer){
this.writer = writer;
}
public void processPOIFSReaderEvent(POIFSReaderEvent event) {
PropertySet ps = null;
try{
org.apache.poi.poifs.filesystem.DocumentInputStream
dis=null;
if(!event.getName().equalsIgnoreCase("PowerPoint
Document"))
return;
System.out.println("\n\n");
System.out.println(event.getPath()+event.getName());
dis=event.getStream();
byte btoWrite[]= new byte[12];
dis.read(btoWrite);
//
System.out.println("Version:"+LittleEndian.getUnsignedByte(btoWrite,0));
//
System.out.println("Instance:"+LittleEndian.getUShort(btoWrite,0));
//
System.out.println("Type:"+LittleEndian.getUShort(btoWrite,2));
//
System.out.println("Len:"+LittleEndian.getLong(btoWrite,4));
btoWrite = new byte[dis.available()];
dis.read(btoWrite, 0, dis.available());
String date = (new Date()).toString();
System.out.println(date);
writer.write(date);
StringBuffer buff = new StringBuffer("");
for(int i=0; i<btoWrite.length-20; i++){
//System.out.println("Version
:"+LittleEndian.getUnsignedByte(btoWrite,i+0));
//System.out.println("Instance
:"+LittleEndian.getUShort(btoWrite,i+0));
//System.out.println("Type:"+LittleEndian.getUShort(btoWrite,i+2));
//System.out.println("Len:"+LittleEndian.getUInt(btoWrite,i+4));
long type=LittleEndian.getUShort(btoWrite,i+2);
long size=LittleEndian.getUInt(btoWrite,i+4);
if (type==4008){
//fos.write(btoWrite,i+4+1,(int)size+3);
int offset = i+4+1;
int length = (int)size+3;
int end = offset + length;
byte[] textBytes = new byte[length];
for (int j = offset; j < end; j++) {
//eliminate special chars, add
spaces otherwisethe words will be concatenated
byte b = btoWrite[j];
writer.write((char) b);
//b = handleSpecialChars(b);
//textBytes [j - offset] = b;
//buff.append((char) b);
}
/*String s = new String(textBytes);
s = handleText(s);
writer.write(s);
*/
if(i < (end -1))
i = end -1;
}
}
writer.close();
System.out.println("gata");
//ps =
PropertySetFactory.create(event.getStream());
}catch (Exception ex){
//System.out.println("No property set stream: \"" +
event.getPath() +
// event.getName() + "\"");
//System.out.println(ex);
return;
}
}
private byte handleSpecialChars(byte b) {
if(b < 32 && b != 10 && b != 9 && b != 13 )
b = 32;
return b;
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]