Hi, Javier:

If it's a smallish set of documents, you can write a loop that reads each 
document and applies a regex to all of the text in the document, but if it is a 
substantial corpus, you should look at enriching the documents to support 
searching for VIN numbers.

To search over a set of values with performance at scale requires an index over 
the values.

To recognize the values within JSON or XML documents, the indexer looks for a 
specified JSON property or XML element or attribute.

That requires modifying the documents on or after ingestion to identify the VIN 
numbers.  (It's easiest if you can specify a unique JSON property or XML 
element or attribute, but if that's not possible, fields can support unions and 
path range indexes can support containment.)

Several natural language processors try to solve this kind of enrichment 
problem.  Maybe someone on the list can recommend specific NLP tools for VIN 
recognition based on their experience.


Hoping that helps,


Erik Hennum

________________________________
From: [email protected] 
[[email protected]] on behalf of Javier Lizarraga 
[[email protected]]
Sent: Tuesday, July 14, 2015 5:21 PM
To: [email protected]
Subject: [MarkLogic Dev General] search MarkLogic Database using Regular 
Expressions

Is there a way to issue a search using a regular expression in MarkLogic?

For example the following regular expression identifies a vin number:
(([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]{9})([a-h,A-H,j-n,J-N,p,P,r-t,R-T,v-z,V-Z,0-9])([a-h,A-H,j-n,J-N,p-z,P-Z,0-9])(\d{6}))

I would like to issue a query that would search the entire database returning 
documents that contain valid vin numbers.

Similar to the MarkLogic fn:match which takes in a string and outputs  a 
Boolean value.
fn:matches("this is my string 2T3JK4DV1AW023473" , 
"(([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]{9})([a-h,A-H,j-n,J-N,p,P,r-t,R-T,v-z,V-Z,0-9])([a-h,A-H,j-n,J-N,p-z,P-Z,0-9])(\d{6}))")

I’d like to do something like this 
cts:search(“(([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]{9})([a-h,A-H,j-n,J-N,p,P,r-t,R-T,v-z,V-Z,0-9])([a-h,A-H,j-n,J-N,p-z,P-Z,0-9])(\d{6})))

Any help would be greatly appreciated!!

Javier
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to