monkmachine commented on PR #558:
URL: https://github.com/apache/tika/pull/558#issuecomment-1128022252

   > 
   
   
   
   > > > @nddipiazza @tballison This looks messy, can you advise a way to clean 
it up? A better way of doing it? Still think its worth having the comments 
there?
   > > 
   > > 
   > > OMG, what a mess. The output, not you.
   > > What I've done before is a regex pattern+matcher that captures the 
escape sequence first OR then the controls ~/(\)|([A-Z0-9]{1,5})/, capture 
group(2) (and skip it), append group 1 to tail.
   > > That's a rough answer and probably wrong, but see what you can do.
   > > The braces...hmmmm... Maybe take a second pass and do the same thing? 
You can't just add this in the OR ~/{[^}]{0,50}}/ because that'll not correctly 
process escaped } within the brackets.
   > 
   > I threw together a somewhat working example. I think there are still some 
things I'm missing: 
https://github.com/tballison/tika-addons/blob/main/DWGReadDev/src/test/java/TestRegexCleaners.java
   > 
   > Obv, we'll want to make the patterns static, etc.
   
   Will take a look @tballison , thanks for your help. I've been cleaning up 
the code to match the checkstyle (which I've only learnt about today) and 
testing my janky regexes (in the current form) against some documents I have.  
Like I said I managed to build Tika Server and check the config was working 
correctly so been a successful few hours today :) Will take a look at your 
example tomorrow and hopefully at some point this week find some time to check 
the stop method on the other pull request. We can then look to create a 
guide/script on how to install Tika Server as a windows service using Daemon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to