[GitHub] [tika] tballison commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

GitBox Wed, 18 May 2022 12:30:17 -0700


tballison commented on PR #558:
URL: https://github.com/apache/tika/pull/558#issuecomment-1130425745


   > If I use buffer reader I get the correct output but it's slower: 3s vs 10s 
(it's quite a large file)
   > 
   > ```
   > 
   > 
   > 
   >                //FileInputStream fis = new 
FileInputStream("c:\\temp1\\dwgreadout.json");
   >             FileOutputStream fos = new 
FileOutputStream("c:\\temp1\\dwgreadoutClean.json");
   >             try (BufferedReader br = new BufferedReader(new 
FileReader("c:\\temp1\\dwgreadout.json"))) 
   >             {
   > 
   >                 String sCurrentLine;
   >                 while ((sCurrentLine = br.readLine()) != null) 
   >                 {
   >                    sCurrentLine = sCurrentLine
   >                        .replaceAll(" nan ", " 0 ")
   >                             .replaceAll(" nan,", " 0,") +"\n";
   >                    fos.write(sCurrentLine.getBytes(), 0, 
sCurrentLine.getBytes().length);
   >                 }
   >             
   >                  //fos.write(fixedBytes, 0, fixedBytes.length);
   >                  
   >                  
   >              }}
   > ```
   
   I agree that it's slower, but it is more likely to be correct.  I don't 
understand the problem as thoroughly as you do, but running regexes against a 
part of a file at a time is guaranteed to fail depending on where the breaks in 
the parts are.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tika] tballison commented on pull request #558: TIKA-1735 - Adding DWGRead parser to Tika if available

Reply via email to