Hi Ian,
I just did a test of the speed of removing the invalid chars using brute
force. Here's my code:
var invalids = System.IO.Path.GetInvalidPathChars()
.Union(System.IO.Path.GetInvalidFileNameChars());
var text = new string('x', 200000);
var query = from c in text
where !invalids.Contains(c)
select c;
var clean = new string(query.ToArray());
My computer manages to strip the chars from a 139,000 character string in
about a second - timed using System.Diagnostics.Stopwatch. So for many
circumstances I think that a brute force approach is quite workable. What do
you think?
Cheers.
James.
From: [email protected] [mailto:[email protected]]
On Behalf Of Ian Thomas
Sent: Friday, 26 November 2010 17:22
To: 'ozDotNet'
Subject: One for next week
My regex is very irregular, so some ideas would be nice
Problem: excluding the prohibited characters from file paths and file names.
I started off thinking that \ / : * ? " < > | would be about the maximum,
and I would just pass the filenames (generated from text titles - eg, books,
videos, etc) though a simple looping routine looking for the 9 prohibited
characters.
Using a simple regex, Regex.Replace(strIn, "[^\...@-]", "") is too
restrictive - for example, bracketed numbers (1), [23], etc are very common.
I've devoted too long to expanding this without much joy, and would
appreciate help.
In my researches, I discovered these two helpful methods in System.IO -
which is why my first approach, comparing characters and arrays, was
abandoned to explore if regular expressions might help.
Path.GetInvalidPathChars() - Get a list of invalid path characters (returns
an array of Char)
and
Path.GetInvalid FileNameChars() - Get a list of invalid filename characters
(returns an array of Char)
The number returned is surprisingly large, so iterating through even a
50-character long filename / path name and checking for the undesirable
characters would be considerably longer than doing the same for 9
characters.
_____
Ian Thomas
Victoria Park, Western Australia