Hi guys,

I am doing some work on analyzing wiki dumps. However, I confront with a
headache problem that some text (<text> under <revision> ) seems to be
malicious. It may only contains one dirty word and repeat again and again.
What makes it worse is that some of such strings seem to be endless, which
leads my parser to get stuck when reading it. I extracted such text to read
under vim and vim shows that it has an exact number of lines. But when I
click page down, it just cannot reach the end and get stuck into endless
messy code.

Have you ever confronted with such problem? Thanks a lot.

Best regards,

-- 
Ning Zhang
Purdue University
E-mail:[email protected]
Cell Phone:765-337-6629
------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to