Re: Extract repetitive unstructured data

jlp765 Tue, 10 Oct 2017 03:45:05 +0200

Taking a "big picture" approach:

  * If you are going to automate extracting data, you must know the rules that 
define the data "fields" you want to extract, or else no tool will do it for 
you.
  * most tools have some regular expression (RE) capability (not just perl), 
but is an RE the answer to **how do I identify each field?**. If you can 
delineate between fields without using an RE, then that may make your code 
faster, easier to modify, or more readable (but it also may not).
  * Are these files small enough that you can read the whole file into memory 
and process a single string of data, or do you need to process the file 
iteratively (by line, or chuck of data) to save on memory?




If you end up using regular expressions in Nim,
    cheat by leveraging what is already done in nimgrep. It is easy to end up 
with RE that is slow in Nim, so if speed is an issue, make sure you benchmark 
it. (The RE isn't slow because it uses PCRE. The reason (I believe) is that it 
is easy (for a newbie) to write code with lots of string allocations that slow 
it down).

The following probably applies to more structured data, but I'll mention it for 
completeness:
    @Araq posted about how using [a tool like 
sqlite](https://forum.nim-lang.org/t/2925#18403) can be a good tool if you are 
then manipulating the extracted data.

Re: Extract repetitive unstructured data

Reply via email to