For kicks, I just wrote up the regex equivalent -

            var sb = new StringBuilder();
            var invalids =
System.IO.Path.GetInvalidPathChars().Union(System.IO.Path.GetInvalidFileNameChars());

            sb.Append("[");
            foreach (var c in invalids)
            {
                sb.Append(@"\"); // escapes the char; no harm if
escaping is not required
                sb.Append(c);
            }
            sb.Append("]");

            var re = new System.Text.RegularExpressions.Regex(sb.ToString());

            var text = new string('x', 200000);

            var clean = re.Replace(text, "");

It averages around 15 ms (including building and compiling the regex
each time), as opposed to ~1000 ms for brute forcing.

And just to show it works, replace var text with this, and the output
should be "startend".

            var sb2 = new StringBuilder();
            sb2.Append("start");
            foreach (var c in invalids)
                sb2.Append(c);
            sb2.Append("end");
            var text = sb2.ToString();

So brute forcing might be easier, but regex is soooo much faster :)

On 26 November 2010 18:32, Ian Thomas <[email protected]> wrote:
> James
>
> Yes, I just did much the same myself, assuming it might be a wait of a few
> seconds – similar result, less than a second. Brute force is less prone to
> making my brain hurt than regular expressions, too!
>
> Thanks for the input.
>
> ________________________________
>
> Ian Thomas
> Victoria Park, Western Australia
>
> ________________________________
>
> From: [email protected] [mailto:[email protected]]
> On Behalf Of James Chapman-Smith
> Sent: Friday, November 26, 2010 3:19 PM
> To: 'ozDotNet'
> Subject: RE: One for next week
>
>
>
> Hi Ian,
>
>
>
> I just did a test of the speed of removing the invalid chars using brute
> force. Here’s my code:
>
>
>
>
>
> var invalids = System.IO.Path.GetInvalidPathChars()
>
>      .Union(System.IO.Path.GetInvalidFileNameChars());
>
>
>
> var text = new string('x', 200000);
>
>
>
> var query = from c in text
>
>                 where !invalids.Contains(c)
>
>                 select c;
>
>
>
> var clean = new string(query.ToArray());
>
>
>
>
>
> My computer manages to strip the chars from a 139,000 character string in
> about a second – timed using System.Diagnostics.Stopwatch.  So for many
> circumstances I think that a brute force approach is quite workable. What do
> you think?
>
>
>
> Cheers.
>
>
>
> James.
>
>
>
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Ian Thomas
> Sent: Friday, 26 November 2010 17:22
> To: 'ozDotNet'
> Subject: One for next week
>
>
>
> My regex is very irregular, so some ideas would be nice
>
> Problem: excluding the prohibited characters from file paths and file names.
> I started off thinking that \ / : * ? " < > | would be about the maximum,
> and I would just pass the filenames (generated from text titles – eg, books,
> videos, etc) though a simple looping routine looking for the 9 prohibited
> characters.
>
> Using a simple regex, Regex.Replace(strIn, "[^\...@-]", "") is too
> restrictive – for example, bracketed numbers (1), [23], etc are very common.
> I’ve devoted too long to expanding this without much joy, and would
> appreciate help.
>
> In my researches, I discovered these two helpful methods in System.IO –
> which is why my first approach, comparing characters and arrays, was
> abandoned to explore if regular expressions might help.
>
> Path.GetInvalidPathChars() - Get a list of invalid path characters (returns
> an array of Char)
>
> and
>
> Path.GetInvalid FileNameChars() - Get a list of invalid filename characters
> (returns an array of Char)
>
> The number returned is surprisingly large, so iterating through even a
> 50-character long filename / path name and checking for the undesirable
> characters would be considerably longer than doing the same for 9
> characters.
>
> ________________________________
>
> Ian Thomas
> Victoria Park, Western Australia
>
> ________________________________
>
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1170 / Virus Database: 426/3278 - Release Date: 11/25/10

Reply via email to